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Abstract 

Compressed sensing (CS) is an emerging field that has attracted considerable research interest over the past few 
years. Previous review articles in CS limit their scope to standard discrete-to-discrete measurement architectures 
using matrices of randomized nature and signal models based on standard sparsity. In recent years, CS has worked 
its way into several new application areas. This, in turn, necessitates a fresh look on many of the basics of CS. The 
random matrix measurement operator must be replaced by more structured sensing architectures that correspond to 
the characteristics of feasible acquisition hardware. The standard sparsity prior has to be extended to include a much 
richer class of signals and to encode broader data models, including continuous-time signals. In our overview, the 
theme is exploiting signal and measurement structure in compressive sensing. The prime focus is bridging theory 
and practice; that is, to pinpoint the potential of structured CS strategies to emerge from the math to the hardware. 
Our summary highlights new directions as well as relations to more traditional CS, with the hope of serving both 
as a review to practitioners wanting to join this emerging field, and as a reference for researchers that attempts to 
put some of the existing ideas in perspective of practical applications. 

I. Introduction and Motivation 

Compressed sensing (CS) is an emerging field that has attracted considerable research interest in the signal 
processing community. Since its introduction only several years ago [1,2], thousands of papers have appeared 
in this area, and hundreds of conferences, workshops, and special sessions have been dedicated to this growing 
research field. 

Due to the vast interest in this topic, there exist several excellent review articles on the basics of CS [3-5]. These 
articles focused on the first CS efforts: the use of standard discrete-to-discrete measurement architectures using 
matrices of randomized nature, where no structure beyond sparsity is assumed on the signal or in its representation. 
This basic formulation already required the use of sophisticated mathematical tools and rich theory in order to 
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analyze recovery approaches and provide performance guarantees. It was therefore essential to confine attention 
to this simplified setting in the early stages of development of the CS framework. 

Essentially all analog-to-digital converters (ADCs) to date follow the celebrated Shannon-Nyquist theorem which 
requires the sampling rate to be at least twice the bandwidth of the signal. This basic principle underlies the 
majority of digital signal processing (DSP) applications such as audio, video, radio receivers, radar applications, 
medical devices and more. The ever growing demand for data, as well as advances in radio frequency (RF) 
technology, have promoted the use of high-bandwidth signals, for which the rates dictated by the Shannon- 
Nyquist theorem impose severe challenges both on the acquisition hardware and on the subsequent storage and 
DSP processors. CS was motivated in part by the desire to sample wideband signals at rates far lower than 
the Shannon-Nyquist rate, while still maintaining the essential information encoded in the underlying signal. In 
practice, however, much of the work to date on CS has focused on acquiring finite-dimensional sparse vectors 
using random measurements. This precludes the important case of continuous-time (i.e., analog) input signals, as 
well as practical hardware architectures which inevitably are structured. Achieving the holy grail of compressive 
ADCs and increased resolution requires a broader framework which can treat more general signal models, including 
continuous-time signals with various types of structure, as well as practical measurement schemes. 

In recent years, the area of CS has branched out to many new fronts and has worked its way into several 
application areas. This, in turn, necessitates a fresh look on many of the basics of CS. The random matrix mea- 
surement operator, fundamental in all early presentations of CS, must be replaced by more structured measurement 
operators that correspond to the application of interest, such as wireless channels, analog sampling hardware, sensor 
networks and optical imaging. The standard sparsity prior that has characterized early work in CS has to be extended 
to include a much richer class of signals: signals that have underlying low-dimensional structure, not necessarily 
represented by standard sparsity, and signals that can have arbitrary dimensions, not only finite-dimensional vectors. 

A significant part of the recent work on CS from the signal processing community can be classified into two 
major contribution areas. The first group consists of theory and applications related to CS matrices that are not 
completely random and that often exhibit considerable structure. This largely follows from efforts to model the 
way the samples are acquired in practice, which leads to sensing matrices that inherent their structure from the real 
world. The second group includes signal representations that exhibit structure beyond sparsity and broader classes 
of signals, such as continuous-time signals with infinite-dimensional representations. For many types of signals, 
such structure allows for a higher degree of signal compression when leveraged on top of sparsity. Additionally, 
infinite-dimensional signal representations provide an important example of richer structure which clearly cannot 
be described using standard sparsity. Since reducing the sampling rate in analog signals was one of the driving 
forces behind CS, building a theory that can accommodate low-dimensional signals in arbitrary Hilbert spaces is 
clearly an essential part of the CS framework. Both of these categories are motivated by real-world CS involving 
actual hardware implementations. 

In our review, the theme is exploiting signal and measurement structure in CS. The prime focus is bridging 
theory and practice - that is, to pinpoint the potential of CS strategies to emerge from the math to the hardware 
by generalizing the underlying theory where needed. We believe that this is an essential ingredient in taking CS 
to the next step: incorporating this fast growing field into real-world applications. Considerable efforts have been 
devoted in recent years by many researchers to adapt the theory of CS to better solve real-world signal acquisition 
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challenges. This has also led to parallel low-rate sampling schemes that combine the principles of CS with the 
rich theory of sampling such as the finite rate of innovation (FRI) [6-8] and Xampling frameworks [9, 10]. There 
are already dozens of papers dealing with these broad ideas. In this review we have strived to provide a coherent 
summary, highlighting new directions and relations to more traditional CS. This material can both serve as a review 
to those wanting to join this emerging field, as well as a reference that attempts to summarize some of the existing 
results in the framework of practical applications. Our hope is that this presentation will attract the interest of both 
mathematicians and engineers in the desire to promote the CS premise into practical applications, and encourage 
further research into this new frontier. 

This review paper is organized as follows. Section II provides background motivating the formulation of CS and 
the layout of the review. A primer on standard CS theory is presented in Section III. This material serves as a basis 
for the later developments. Section IV reviews alternative constructions for structured CS matrices beyond those 
generated completely at random. In Section V we introduce finite-dimensional signal models that exhibit additional 
signal structure. This leads to the more general union-of-subspaces framework, which will play an important role 
in the context of structured infinite-dimensional representations as well. Section VI extends the concepts of CS to 
infinite-dimensional signal models and introduces recent compressive ADCs which have been developed based on 
the Xampling and FRI frameworks. For each of the matrices and models introduced, we summarize the details of 
the theoretical and algorithmic frameworks and provide example applications where the structure is applicable. 

II. Background 

We live in a digital world. Telecommunication, entertainment, medical devices, gadgets, business - all revolve 
around digital media. Miniature sophisticated black-boxes process streams of bits accurately at high speeds. 
Nowadays, electronic consumers feel natural that a media player shows their favorite movie, or that their surround 
system synthesizes pure acoustics, as if sitting in the orchestra instead of the living room. The digital world plays 
a fundamental role in our everyday routine, to such a point that we almost forget that we cannot "hear" or "watch" 
these streams of bits, running behind the scenes. 

Analog to digital conversion lies at the heart of this revolution. ADC devices translate the physical information 
into a stream of numbers, enabling digital processing by sophisticated software algorithms. The ADC task is 
inherently intricate: its hardware must hold a snapshot of a fast-varying input signal steady while acquiring 
measurements. Since these measurements are spaced in time, the values between consecutive snapshots are lost. In 
general, therefore, there is no way to recover the analog signal unless some prior on its structure is incorporated. 

After sampling, the numbers or bits retained must be stored and later processed. This requires ample storage 
devices and sufficient processing power. As technology advances, so does the requirement for ever-increasing 
amounts of data, imposing unprecedented strains on both the ADC devices and the subsequent DSP and storage 
media. How then does consumer electronics keep up with these high demands? Fortunately, most of the data we 
acquire can be discarded without much perceptual loss. This is evident in essentially all compression techniques 
used to date. However, this paradigm of high-rate sampling followed by compression does not alleviate the large 
strains on the acquisition device and on the DSR In his seminal work on CS [1], Donoho posed the ultimate goal 
of merging compression and sampling: "why go to so much effort to acquire all the data when most of what we 
get will be thrown away? Can't we just directly measure the part that won't end up being thrown away?". 
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A. Shannon-Nyquist Theorem 

ADCs provide the interface between an analog signal being recorded and a suitable discrete representation. A 
common approach in engineering is to assume that the signal is bandlimited, meaning that the spectral contents 
are confined to a maximal frequency B. Bandlimited signals have limited time variation, and can therefore be 
perfectly reconstructed from equispaced samples with rate at least 2B, termed the Nyquist rate. This fundamental 
result is often attributed in the engineering community to Shannon-Nyquist [11,12], although it dates back to 
earlier works by Whittaker [13] and Kotelriikov [14]. 

Theorem 1. [12] If a function x(t) contains no frequencies higher than B hertz, then it is completely determined 
by giving its ordinates at a series of points spaced 1/(2B) seconds apart. 

A fundamental reason for processing at the Nyquist rate is the clear relation between the spectrum of x(t) and 
that of its samples x(nT), so that digital operations can be easily substituted for their analog counterparts. Digital 
filtering is an example where this relation is successfully exploited. Since the power spectral densities of analog 
and discrete random processes are associated in a similar manner, estimation and detection of parameters of analog 
signals can be performed by DSP. In contrast, compression is carried out by a series of algorithmic steps, which, 
in general, exhibit a nonlinear complicated relationship between the samples x(nT) and the stored data. 

While this framework has driven the development of signal acquisition devices for the last half century, the 
increasing complexity of emerging applications dictates increasingly higher sampling rates that cannot always be 
met using available hardware. Advances in related fields such as wideband communication and RF technology 
open a considerable gap with ADC devices. Conversion speeds which are twice the signal's maximal frequency 
component have become more and more difficult to obtain. Consequently, alternatives to high rate sampling are 
drawing considerable attention in both academia and industry. 

Structured analog signals can often be processed far more efficiently than what is dictated by the Shannon- 
Nyquist theorem, which does not take any structure into account. For example, many wideband communication 
signals are comprised of several narrow transmissions modulated at high carrier frequencies. A common practice in 
engineering is demodulation in which the input signal is multiplied by the carrier frequency of a band of interest, 
in order to shift the contents of the narrowband transmission from the high frequencies to the origin. Then, 
commercial ADC devices at low rates are utilized. Demodulation, however, requires knowing the exact carrier 
frequency. In this review we focus on structured models in which the exact parameters defining the structure are 
unknown. In the context of multiband communications, for example, the carrier frequencies may not be known, or 
may be changing over time. The goal then is to build a compressed sampler which does not depend on the carrier 
frequencies, but can nonetheless acquire and process such signals at rates below Nyquist. 

B. Compressed Sensing and Beyond 

A holy grail of CS is to build acquisition devices that exploit signal structure in order to reduce the sampling 
rate, and subsequent demands on storage and DSP. In such an approach, the actual information contents dictate 
the sampling rate, rather than the dimensions of the ambient space in which the signal resides. The challenges in 
achieving this task both theoretically and in terms of hardware design can be reduced substantially when considering 
finite-dimensional problems in which the signal to be measured can be represented as a discrete finite-length vector. 
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This has spurred a surge of research on various mathematical and algorithmic aspects of sensing sparse signals, 
which were mainly studied for discrete finite vectors. 

At its core, CS is a mathematical framework that studies accurate recovery of a signal represented by a vector 
of length N from M <C N measurements, effectively performing compression during signal acquisition. The 
measurement paradigm consists of linear projections, or inner products, of the signal vector into a set of carefully 
chosen projection vectors that act as a multitude of probes on the information contained in the signal. In the 
first part of this review (Sections III and IV) we survey the fundamentals of CS and show how the ideas can be 
extended to allow for more elaborate measurement schemes that incorporate structure into the measurement process. 
When considering real-world acquisition schemes, the choices of possible measurement matrices are dictated by 
the constraints of the application. Thus, we must deviate from the general randomized constructions and apply 
structure within the projection vectors that can be easily implemented by the acquisition hardware. Section IV 
focuses on such alternatives; we survey both existing theory and applications for several classes of structured CS 
matrices. In certain applications, there exist hardware designs that measure analog signals at a sub-Nyquist rate, 
obtaining measurements for finite-dimensional signal representations via such structured CS matrices. 

In the second part of this review (Sections V and VI) we expand the theory of CS to signal models tailored to 
express structure beyond standard sparsity. A recent emerging theoretical framework that allows a broader class 
of signal models to be acquired efficiently is the union of subspaces model [15-20]. We introduce this framework 
and some of its applications in a finite-dimensional context in Section V, which include more general notions of 
structure and sparsity. Combining the principles and insights from the previous sections, in Section VI we extend 
the notion of CS to analog signals with infinite-dimensional representations. This new framework, referred to as 
Xampling [9, 10], relies on more general signal models - union of subspaces and FRI signals - together with 
guidelines on how to exploit these mathematical structures in order to build sensing devices that can directly 
acquire analog signals at reduced rates. We then survey several compressive ADCs that result from this broader 
framework. 

III. Compressed Sensing Basics 

Compressed sensing (CS) [1-5] offers a framework for simultaneous sensing and compression of finite-dimensional 
vectors, that relies on linear dimensionality reduction. Specifically, in CS we do not acquire x directly but rather 
acquire M < N linear measurements y = &x using an M x N CS matrix <£. We refer to y as the measurement 
vector. Ideally, the matrix <E> is designed to reduce the number of measurements M as much as possible while 
allowing for recovery of a wide class of signals x from their measurement vectors y. However, the fact that M < N 
renders the matrix $ rank-dejficient, meaning that it has a nonempty nullspace; this, in turn, implies that for any 
particular signal xq € M N , an infinite number of signals x will yield the same measurements yo = <£xo = &x for 
the chosen CS matrix <E>. 

The motivation behind the design of the matrix <I> is, therefore, to allow for distinct signals x, x' within a class 
of signals of interest to be uniquely identifiable from their measurements y = &x, y' = <&x' , even though M <C N. 
We must therefore make a choice on the class of signals that we aim to recover from CS measurements. 
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A. Sparsity 

Sparsity is the signal structure behind many compression algorithms that employ transform coding, and is the 
most prevalent signal structure used in CS. Sparsity also has a rich history of applications in signal process- 
ing problems in the last century (particularly in imaging), including denoising, deconvolution, restoration, and 
inpainting [21-23]. 

To introduce the notion of sparsity, we rely on a signal representation in a given basis {ipi}f =l for R N . Every 
signal x G R N is representable in terms of N coefficients {9i}f =l as x = YliLi ^fli'i arranging the ipi as columns 
into the N x N matrix ^ and the coefficients Oi into the N x 1 coefficient vector 9, we can write succinctly that 
x = ^9, with 9 G M. N . Similarly, if we use a frame 1 * containing N unit-norm column vectors of length L with 
L < N (i.e., \I/ G R LxN ), then for any vector x G R L there exist infinitely many decompositions 9 G H N such that 
x = ^9. In a general setting, we refer to ^ as the sparsifying dictionary [24]. While our exposition is restricted 
to real-valued signals, the concepts are extendable to complex signals as well [25,26]. 

We say that a signal x is K-sparse in the basis or frame ^ if there exists a vector 9 G M N with only K <C N 
nonzero entries such that x = ^9. We call the set of indices corresponding to the nonzero entries the support of 
9 and denote it by supp(#). We also define the set that contains all signals x that are K-sparse. 

A K-sparse signal can be efficiently compressed by preserving only the values and locations of its nonzero 
coefficients, using 0(K\og 2 N) bits: coding each of the K nonzero coefficient's locations takes log 2 N bits, while 
coding the magnitudes uses a constant amount of bits that depends on the desired precision, and is independent 
of N. This process is known as transform coding, and relies on the existence of a suitable basis or frame ^ that 
renders signals of interest sparse or approximately sparse. 

For signals that are not exactly sparse, the amount of compression depends on the number of coefficients of 
9 that we keep. Consider a signal x whose coefficients 9, when sorted in order of decreasing magnitude, decay 
according to the power law 

\9(l(n))\ <Sn-V r , n=l,...,N, (1) 

where X indexes the sorted coefficients. Thanks to the rapid decay of their coefficients, such signals are well- 
approximated by K-sparse signals. The best K-term approximation error for such a signal obeys 

ay(x, K) := arg min \\x - x'\\ 2 < CSK~ S , (2) 

with s = £ — i and C denoting a constant that does not depend on N [27]. That is, the signal's best approximation 
error (in an l 2 -norm sense) has a power-law decay with exponent s as K increases. We dub such a signal s- 
compressible. When ^ is an orthonormal basis, the best sparse approximation of x is obtained by hard thresholding 
the signal's coefficients, so that only the K coefficients with largest magnitudes are preserved. The situation is 
decidedly more complicated when ^ is a general frame, spurring the development of sparse approximation methods, 
a subject that we will focus on in Section III-C. 

When sparsity is used as the signal structure enabling CS, we aim to recover x from y by exploiting its sparsity. 
In contrast with transform coding, we do not operate on the signal x directly, but rather only have access to the 

'A matrix * is said to be a frame if there exist constants A, B such that v4|| a;|| 2 < ||*^||2 < -BH^Ih for all x. 
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CS measurement vector y. Our goal is to push M as close as possible to K in order to perform as much signal 
"compression" during acquisition as possible. In the sequel, we will assume that \£ is taken to be the identity basis 
so that the signal x = 9 is itself sparse. In certain cases we will explicitly define a different basis or frame ^ that 
arises in a specific application of CS. 

B. Design of CS Matrices 

The main design criteria for the CS matrix <I> is to enable the unique identification of a signal of interest x 
from its measurements y = <3?x. Clearly, when we consider the class of K-sparse signals S^, the number of 
measurements M > K for any matrix design, since the identification problem has K unknowns even when the 
support SI = supp(x) of the signal x is provided. In this case, we simply restrict the matrix to its columns 
corresponding to the indices in Q, denoted by and then use the pseudoinverse to recover the nonzero coefficients 
of x: 

xn = <S>ly- (3) 

Here xn is the restriction of the vector x to the set of indices U, and = (M T M) _1 M T denotes the 
pseudoinverse of the matrix M. The implicit assumption in (3) is that &q has full column-rank so that there 
is a unique solution to the equation y = ^qxq. 

We begin by determining properties of that guarantee that distinct signals x, x' G T>k, x / x', lead to different 
measurement vectors $x ^ In other words, we want each vector y G M. M to be matched to at most one 
vector x G such that y = &x. A key relevant property of the matrix in this context is its spark. 
Definition 1. [28] The spark spark(<I>) of a given matrix $ is the smallest number of columns of <I> that are 
linearly dependent. 

The spark is related to the Kruskal Rank from the tensor product literature; the matrix $ has Kruskal rank 
spark(<£) — 1. This definition allows us to pose the following straightforward guarantee. 

Theorem 2. [28] If spark (<&) > 2K, then for each measurement vector y G M M there exists at most one signal 
x G T,k such that y = <E>x. 

It is easy to see that spark(<!>) G [2, M + 1], so that Theorem 2 yields the requirement M > 2K. 

While Theorem 2 guarantees uniqueness of representation for i^T-sparse signals, computing the spark of a general 
matrix <E> has combinatorial computational complexity, since one must verify that all sets of columns of a certain 
size are linearly independent. Thus, it is preferable to use properties of <E> that are easily computable to provide 
recovery guarantees. The coherence of a matrix is one such property. 

Definition 2. [28-31] The coherence of a matrix $ is the largest absolute inner product between any two 
columns of <E>: 

u m= max (4) 



It can be shown that G 



N -M i 
/-!)> L 



; the lower bound is known as the Welch bound [32,33]. Note 



M(N-l) ■ 

that when N 3> M, the lower bound is approximately /i($) > 1/y/M. One can tie the coherence and spark of a 
matrix by employing the Gershgorin circle theorem. 
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Theorem 3. [34] The eigenvalues of an m x m matrix M with entries Mjj, 1 < i, j < m, lie in the union of 
m discs di = di(ci,ri), 1 < i < m, centered at Ci = M^j with radius ri = Ylj^i 

Applying this theorem on the Gram matrix G = leads to the following result. 

Lemma 1. [28] For any matrix 3>, 

spark($) > 1 + -^. (5) 

By merging Theorem 2 with Lemma 1, we can pose the following condition on <& that guarantees uniqueness. 
Theorem 4. [28,30,31] If 

then for each measurement vector y G R M there exists at most one signal x G Y,k such that y 

Theorem 4, together with the Welch bound, provides an upper bound on the level of sparsity K that guarantees 
uniqueness using coherence: K = O(VM). 

The prior properties of the CS matrix provide guarantees of uniqueness when the measurement vector y is 
obtained without error. Hardware considerations introduce two main sources of inaccuracies in the measurements: 
inaccuracies due to noise at the sensing stage (in the form of additive noise y = &x + n), and inaccuracies due to 
mismatches between the CS matrix used during recovery, <E>, and that implemented during acquisition, <£' = + A 
(in the form of multiplicative noise [35, 36]). Under these sources of error, it is no longer possible to guarantee 
uniqueness; however, it is desirable for the measurement process to be tolerant to both types of error. To be more 
formal, we would like the distance between the measurement vectors for two sparse signals y = <£x, y' = Qx' to 
be proportional to the distance between the original signal vectors x and x'. Such a property allows us to guarantee 
that, for small enough noise, two sparse vectors that are far apart from each other cannot lead to the same (noisy) 
measurement vector. This behavior has been formalized into the restricted isometry property (RIP). 

Definition 3. [4] A matrix $ has the (K, 8)-restricted isometry property ((K, <5)-RIP) if, for all x G S^, 

(1- 5)11*11! < \\$x\\ 2 2 <(l + 5)\\x\\l (7) 

In words, the (K, <5)-RIP ensures that all submatrices of $ of size M x K are close to an isometry, and therefore 
distance-preserving. We will show later that this property suffices to prove that the recovery is stable to presence 
of additive noise n. In certain settings, noise is introduced to the signal x prior to measurement. Recovery is also 
stable for this case; however, there is a degradation in the distortion of the recovery by a factor of N/M [37-40]. 
Furthermore, the RIP also leads to stability with respect to the multiplicative noise introduced by the CS matrix 
mismatch A [35,36]. 

The RIP can be connected to the coherence property by using, once again, the Gershgorin circle theorem 
(Theorem 3). 



(6) 



= ®x. 
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Lemma 2. [41] If <i> has unit-norm columns and coherence fi = //(3>), then <3? has the (K,5)-RIP with 5 < 
{K-l)n. 

One can also easily connect RIP with the spark. For each i\~-sparse vector to be uniquely identifiable by its 
measurements, it suffices for the matrix $ to have the (2K, 5)-RIP with 5 > 0, as this implies that all sets of 
2K columns of $ are linearly independent, i.e., spark(<3?) > 2K (cf. Theorems 2 and 4). We will see later that 
the RIP enables recovery guarantees that are much stronger than those based on spark and coherence. However, 
checking whether a CS matrix satisfies the (K, (5)-RIP has combinatorial computational complexity. 

Now that we have defined relevant properties of a CS matrix <E>, we discuss specific matrix constructions that 
are suitable for CS. AnMxJV Vandermonde matrix V constructed from N distinct scalars has spark(V) = 
M + 1 [27]. Unfortunately, these matrices are poorly conditioned for large values of N, rendering the recovery 
problem numerically unstable. Similarly, there are known matrices <1> of size M x M 2 that achieve the coherence 
lower bound 

M (<&) = 1/VM (8) 

such as the Gabor frame generated from the Alltop sequence [42] and more general equiangular tight frames [33]. 
It is also possible to construct deterministic CS matrices of size M x N that have the (K, <5)-RIP for K = 
0{\[M log Mj \og(N/M)) [43]. These constructions restrict the number of measurements needed to recover a 
K-sparse signal to be M = 0{K 2 log AT), which is undesirable for real-world values of N and K. 

Fortunately, these bottlenecks can be defeated by randomizing the matrix construction. For example, random 
matrices <I> of size M x N whose entries are independent and identically distributed (i.i.d.) with continuous 
distributions have spark(<I>) = M + 1 with high probability. It can also be shown that when the distribution used 
has zero mean and finite variance, then in the asymptotic regime (as M and iV grow) the coherence converges 
to /u(<E>) = 2-y/log N/M [44,45]. Similarly, random matrices from Gaussian, Rademacher, or more generally a 
subgaussian distribution 2 have the (K, <5)-RIP with high probability if [46] 

M = 0(K\og(N/K)/5 2 ). (9) 

Finally, we point out that while the set of RIP-fulfilling matrices provided above might seem limited, emerging 
numerical results have shown that a variety of classes of matrices <I> are suitable for CS recovery at regimes similar 
to those of the matrices advocated in this section, including subsampled Fourier and Hadamard transforms [47, 
48]. 

C. CS Recovery Algorithms 

We now focus on solving the CS recovery problem: given y and <3?, find a signal x within the class of interest 
such that y = &x exactly or approximately. 

2 A Rademacher distribution gives probability 1/2 to the values ±1. A random variable X is called subgaussian if there exists c > 
such that E (e xt ) < e c * ^ 2 for all (el. Examples include the Gaussian, Bernoulli, and Rademacher random variables, as well as any 
bounded random variable. 
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When we consider sparse signals, the CS recovery process consists of a search for the sparsest signal x that 
yields the measurements y. By defining the £q "norm" of a vector ||x||o as the number of nonzero entries in x, 
the simplest way to pose a recovery algorithm is using the optimization 

x = arg min ||x||o subject to y = &x. (10) 

x£R N 

Solving (10) relies on an exhaustive search and is successful for all x G T,k when the matrix has the sparse 
solution uniqueness property (i.e., for M as small as 2K, see Theorems 2 and 4). However, this algorithm has 
combinatorial computational complexity, since we must check whether the measurement vector y belongs to the 
span of each set of K columns of $, K = 1,2, ... ,N. Our goal, therefore, is to find computationally feasible 
algorithms that can successfully recover a sparse vector x from the measurement vector y for the smallest possible 
number of measurements M. 

An alternative to the £q "norm" used in (10) is to use the l\ norm, defined as ||x||i = Yln=i \ x ( n )\- The resulting 
adaptation of (10), known as basis pursuit (BP) [22], is formally defined as 

x = arg min ||x||i subject to y = <3?x. (11) 

xeR N 

Since the t\ norm is convex, (11) can be seen as a convex relaxation of (10). Thanks to the convexity, this 
algorithm can be implemented as a linear program, making its computational complexity polynomial in the signal 
length [49]. 3 

The optimization (11) can be modified to allow for noise in the measurements y = <3?x + n; we simply change 
the constraint on the solution to 

x = arg min llxlli subject to llw — ^xlh < e, (12) 

xt=VL N 

where e > ||n||2 is an appropriately chosen bound on the noise magnitude. This modified optimization is known as 
basis pursuit with inequality constraints (BPIC) and is a quadratic program with polynomial complexity solvers [49]. 
The Lagrangian relaxation of this quadratic program is written as 

x = arg min \\x\\i + Ally — ^xlU, (13) 

and is known as basis pursuit denoising (BPDN). There exist many efficient solvers to find BP, BPIC, and BPDN 
solutions; for an overview, see [51]. 

Oftentimes, a bounded-norm noise model is overly pessimistic, and it may be reasonable instead to assume that 
the noise is random. For example, additive white Gaussian noise n ~ Af(0, a 2 T) is a common choice. Approaches 
designed to address stochastic noise include complexity-based regularizaton [52] and Bayesian estimation [53]. 
These methods pose probabilistic or complexity-based priors, respectively, on the set of observable signals. 
The particular prior is then leveraged together with the noise probability distribution during signal recovery. 
Optimization-based approaches can also be formulated in this case; one of the most popular techniques is the 
Dantzig selector [54]: 

x = arg min ||x||i s. t. ||3> T (y — $a;)||oo < Av/k>g Na, (14) 



3 A similar set of recovery algorithms, known as total variation minimizers, operate on the gradient of an image, which exhibits sparsity 
for piecewise smooth images [50]. 
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Algorithm 1 Orthogonal Matching Pursuit 

Input: CS matrix <!>, measurement vector y 
Output: Sparse representation x 
Initialize: xq = 0, r = y, Cl = 0, i = 
while halting criterion false do 

i <— i + 1 

b <— & T r {form residual signal estimate} 

$7 ^— U supp(T(6, 1)) {update support with residual} 

x~i\n «— &qV, Xi\n c ^—0 {update signal estimate} 

r «— y — §Xi {update measurement residual} 
end while 
return x <— x ; 



where || • \ O0 denotes the ^-norm, which provides the largest-magnitude entry in a vector and A is a constant 
parameter that controls the probability of successful recovery. 

An alternative to optimization-based approaches, are greedy algorithms for sparse signal recovery. These methods 
are iterative in nature and select columns of <I> according to their correlation with the measurements y determined 
by an appropriate inner product. For example, the matching pursuit and orthogonal matching pursuit algorithms 
(OMP) [24, 55] proceed by finding the column of <3? most correlated to the signal residual, which is obtained by 
subtracting the contribution of a partial estimate of the signal from y. The OMP method is formally defined as 
Algorithm 1, where T(x,K) denotes a thresholding operator on x that sets all but the K entries of x with the 
largest magnitudes to zero, and x|n denotes the restriction of x to the entries indexed by fl. The convergence 
criterion used to find sparse representations consists of checking whether y = &x exactly or approximately; note 
that due to its design, the algorithm cannot run for more than M iterations, as $ has M rows. Other greedy 
techniques that have a similar flavor to OMP include CoSaMP [56], detailed as Algorithm 2, and Subspace Pursuit 
(SP) [57]. A simpler variant is known as iterative hard thresholding (IHT) [58]: starting from an initial signal 
estimate xq = 0, the algorithm iterates a gradient descent step followed by hard thresholding, i.e., 

Xi = T{x i - 1 + $ T (y-<!>x i - 1 ),K), (15) 

until a convergence criterion is met. 

D. CS Recovery Guarantees 

Many of the CS recovery algorithms above come with guarantees on their performance. We group these results 
according to the matrix metric used to obtain the guarantee. 

First, we review reuslts that rely on coherence. As a first example, BP and OMP recover a K-sparse vector from 
noiseless measurements when the matrix $ satisfies (6) [28,30,31]. There also exist coherence-based guarantees 
designed for meausrements corrupted with arbitrary noise. 
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Algorithm 2 CoSaMP 

Input: CS matrix <!>, measurement vector y, sparsity K 
Output: If -sparse approximation x to true signal x 
Initialize: xq = 0, r = y, i = 
while halting criterion false do 
i <— i + 1 

e <— <£ T r {form residual signal estimate} 

ft «— supp(T(e, 2K)) {prune residual} 

T «— SI U supp(xj-i) {merge supports} 

b\r <- 3>y2A fr|T c ^—0 {form signal estimate} 

Xi «— T(6, if) {prune signal using model} 

r «— y — Qxi {update measurement residual} 
end while 
return x <— x ; 



Theorem 5. [59] Let the signal x £ and write y = &x + n. Denote 7 = ||n||2. Suppose that K < 
(1//jl($>) + l)/4 a«<i e > 7 in (72). TTieTi f/ie output x of (12) has error bounded by 

t + e 

s-g 2 < , (16) 
" Vl-M(f)(4if-1) 

wMe f/ie output x of the OMP algorithm with halting criterion \\r\\2 < 7 has error bounded by 

\\x-x\\ 2 < , 7 , (17) 

" y/l-»(<S>)(K-l) 

provided that 7 < A(l — (i(<&)(2K — l))/2 for OMP, with A being a positive lower bound on the magnitude of 
the nonzero entries of x. 

Note here that BPIC must be aware of the noise magnitude 7 to make e > 7, while OMP must be aware of the 
noise magnitude 7 to set an appropriate convergence criterion. Additionally, the error in Theorem 5 is proportional 
to the noise magnitude 7. This is because the only assumption on the noise is its magnitude, so that n might be 
aligned to maximally harm the estimation process. 

In the random noise case, bounds on ||x — x\\2 can only be stated in high probability, since there is always 
a small probability that the noise will be very large and completely overpower the signal. For example, under 
additive white Gaussia n noise (AWG N), the guarantees for BPIC in Theorems 5 hold with high probability when 
the parameter e = a\J M + 77V2M, with i] denoting an adjustable parameter to control the probability of ||n||2 
being too large [60]. A second example gives a related result for the BPDN algorithm. 



Theorem 6. [61] Let the signal x € and write y = <E>x + n, where n ~ A/"(0, a 2 I). Suppose that K < l/3/i(3>) 
and consider the BPDN optimization problem (13) with A = \J 16a 2 log M. Then, with probability on the order 
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of 1 — 1/M 2 , the solution x of (13) is unique, its error is bounded by 

\\x-x\\ 2 < Co-tJK log M, (18) 
and its support is a subset of the true K -element support of x. 

Under AWGN, the value of e one would need to choose in Theorem 5 is 0(a\^M), giving a bound much larger 
than Theorem 6, which is 0(o\JK logM). This demonstrates the noise reduction achievable due to the adoption 
of the random noise model. These guarantees come close to the Cramer-Rao bound, which is given by o\[K [62]. 
We finish the study of coherence-based guarantees for the AWGN setting with a result for OMR 
Theorem 7. [61] Let the signal x G and write y = &x + n, where n ~ M(0,a 2 I). Suppose that K < 

+ !)/ 4 and 

min \x{n)\ > ~ V .„V — -,\ ^ ( 19 ) 
i<n<Af' v Jl ~ 1 - (2K - l)/i($) 

for some constant a > 0. Then, with probability at least 1 — (N a ^ir(\ + a) \ogN)~ l , the output x of OMP after 
K iterations has error bounded by 

\\x-x\\ 2 < Cay/(1 + a)KlogN, (20) 
and its support matches the true K-element support of x. 

The greedy nature of OMP poses the requirement on the minimum absolute-valued entry of x in order for the 
support to be correctly detected, in contrast to BPIC and BPDN. 

A second class of guarantees are based on the RIP. The following result for OMP provides an interesting 
viewpoint of greedy algorithms. 

Theorem 8. [63] Let the signal x G and write y = <&x. Suppose that $ has the (K + 1, 5)-RIP with 5 < -7^^- 
Then OMP can recover a K-sparse signal x exactly in K iterations. 

Guarantees also exist for noisy measurement settings, albeit with significantly more stringent RIP conditions on 
the CS matrix. 

Theorem 9. [64] Let the signal x G and write y = &x + n. Suppose that $ has the (31K, 5)-RIP with 
5 < 1/3. Then the output of OMP after 30K iterations has error bounded by 

\\x - x\\ 2 < C\\n\\ 2 . (21) 

The next result extends guarantees from sparse to more general signals measured under noise. We collect a set 
of independent statements in a single theorem. 

Theorem 10. [4, 56-58] Let the signal x £ and write y = &x + n. The outputs x of the CoSaMP, SP, 1HT, 
and BPIC algorithms, with $ having the (cK, 5)-RIP, obey 

\\x - x\\ 2 < C\\\x - xxh + C 2 — ]=\\x - xk\\i + C 3 \\n\\ 2 , (22) 

v K 
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where xk = argmnv e £ K \\x — x'\\2 is the best K-sparse approximation of the vector x when measured in the £2 
norm. The requirements on the parameters c, 5 of the RIP and the values of C\, C2, and C3 are specific to each 
algorithm. For example, for the BPIC algorithm, c = 2 and 5 = y/2 — 1 suffice to obtain the guarantee (22). 

The type of guarantee given in Theorem 10 is known as uniform instance optimality, in the sense that the CS 
recovery error is proportional to that of the best K-sparse approximation to the signal x for any signal x G M. N . In 
fact, the formulation of the CoSaMP, SP and IHT algorithms was driven by the goal of instance optimality, which 
has not been shown for older greedy algorithms like MP and OMP Theorem 10 can also be adapted to recovery 
of exactly sparse signals from noiseless measurements. 

Corollary 11. Let the signal x G Y<k and write y = $x. The CoSaMP, SP, IHT, and BP algorithms can exactly 
recover x from y if Q has the (cK,5)-RIP, where the parameters c, 5 of the RIP are specific to each algorithm. 

Similarly to Theorem 5, the error in Theorem 10 is proportional to the noise magnitude ||ra||2, and the bounds 
can be tailored to random noise with high probability. The Dantzig selector improves the scaling of the error in 
the AWGN setting. 

Theorem 12. [54] Let the signal x G and write y = <frx + n, where n ~ M(0,a 2 I). Suppose that A = 
\/2(l + l/i) in (14) and that $ has the (2K, 82K) and {3K, 53k)-RIPs with S2K + S3K < 1- Then, with probability 
at least 1 — N l /yjit logiV, we have 

\\x-x\\2 < C(\ + l/t) 2 Ka 2 \ogN. (23) 

Similar results under AWGN have been shown for the OMP and thresholding algorithms [61]. 

A third class of guarantees relies on metrics additional to coherence and RIP. This class has a non-uniform flavor 
in the sense that the results apply only for a certain subset of sufficiently sparse signals. Such flexibility allows for 
significant relaxations on the properties required from the matrix <E>. The next example has a probabilistic flavor 
and relies on the coherence property. 

Theorem 13. [65] Let the signal x G with support drawn uniformly at random from the available (^) 
possibilities and entries drawn independently from a distribution P(X) so that P(X > 0) = P(X < 0). Write 
y = <&x and fix s > 1 and < S < 1/2. If K < 'gffff and 

^18 log £ log(f + + 2 f WI < (l - 2-V) , 

then x is the unique solution to BP (11) with probability at least 1 — 2(5 — (K/2)~ s . 

In words, the theorem says that as long as the coherence and the spectral norm ||3>||2 of the CS matrix 

are small enough, we will be able to recover the majority of K-sparse signals x from their measurements y. 
Probabilistic results that rely on coherence can also be obtained for the BPDN algorithm (13) [45]. 

The main difference between the guarantees that rely solely on coherence and those that rely on the RIP and 
probabilistic sparse signal models is the scaling of the number of measurements M needed for successful recovery 
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of if-sparse signals. According to the bounds (8) and (9), the sparsity level that allows for recovery with high 
probability in Theorems 10, 12, and 13 is K = O(M), compared with K = C(a/M) for the deterministic 
guarantees provided by Theorems 5, 6, and 7. This so-called square root bottleneck [65] gives an additional reason 
for the popularity of randomized CS matrices and sparse signal models. 

IV. Structure in CS Matrices 

While most initial work in CS has emphasized the use of randomized CS matrices whose entries are obtained 
independently from a standard probability distribution, such matrices are often not feasible for real-world applica- 
tions due to the cost of multiplying arbitrary matrices with signal vectors of high dimension. In fact, very often the 
physics of the sensing modality and the capabilities of sensing devices limit the types of CS matrices that can be 
implemented in a specific application. Furthermore, in the context of analog sampling, one of the prime motivations 
for CS is to build analog samplers that lead to sub-Nyquist sampling rates. These involve actual hardware and 
therefore structured sensing devices. Hardware considerations require more elaborate signal models to reduce the 
number of measurements needed for recovery as much as possible. In this section, we review available alternatives 
for structured CS matrices; in each case, we provide known performance guarantees, as well as application areas 
where the structure arises. In Section VI we extend the CS framework to allow for analog sampling, and introduce 
further structure into the measurement process. This results in new hardware implementations for reduced rate 
samplers based on extended CS principles. Note that the survey of CS devices given in this section is by no 
means exhaustive [66-68]; our focus is on CS matrices that have been investigated from both a theoretical and an 
implementational point of view. 

A. Subsampled Incoherent Bases 

The key concept of a frame's coherence can be extended to pairs of orthonormal bases. This enables a new 
choice for CS matrices: one simply selects an orthonormal basis that is incoherent with the sparsity basis, and 
obtains CS measurements by selecting a subset of the coefficients of the signal in the chosen basis [69]. We note 
that some degree of randomness remains in this scheme, due to the choice of coefficients selected to represent the 
signal as CS measurements. 

1) Formulation: Formally, we assume that a basis <E> G R NxN is provided for measurement purposes, where 
each column of $ = [fa fa ... 4>n] corresponds to a different basis element. Let $ be an JV x M column 

— T 

submatrix of $ that preserves the basis vectors with indices Y and set y = x. Under this setup, a different 
metric arises to evaluate the performance of CS. 

Definition 4. [28, 69] The mutual coherence of the iV-dimensional orthonormal bases <E> and ^ is the maximum 
absolute value of the inner product between elements of the two bases: 

max \{faAj)\, (24) 

1<i,j<N 

where ipj denotes the j th column, or element, of the basis ^. 

The mutual coherence / u($,\E') has values in the range [iV -1 / 2 ,!]. For example, VP) = N~ 1 / 2 when $ 
is the discrete Fourier transform basis, or Fourier matrix, and ^ is the canonical basis, or identity matrix, and 
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viz) = 1 when both bases share at least one element or column. Note also that the concepts of coherence and 
mutual coherence are connected by the equality /i(<J>, ^) = /i([<I> \&]). The definition of mutual coherence can also 
be extended to infinite-dimensional representations of continuous-time (analog) signals [33,70]. 

2) Theoretical guarantees: The following theorem provides a recovery guarantee based on mutual coherence. 
Theorem 14. [69] Let x = ^9 be a K -sparse signal in ^ with support U C {1, . . . , N}, \Q\ = K, and with 
entries having signs chosen uniformly at random. Choose a subset T C {1, . . . , N} uniformly at random for the set 
of observed measurements, with M = \T\. Suppose that M > CKNp 2 (^,^f)\og(N/S) and M > C'\og 2 (N/5) 
for fixed values of 5 < 1,C, C. Then with probability at least 1 — 5, is the solution to (11). 

The number of measurements required by Theorem 14 ranges from 0(K log N) to O(N). It is possible to 
expand the guarantee of Theorem 14 to compressible signals by adapting an argument of Rudelson and Vershynin 
in [71] to link coherence and restricted isometry constants. 

Theorem 15. [71, Remark 3.5.3] Choose a subset T C {1, . . . , N} for the set of observed measurements, with 
M = |T|. Suppose that 

M > CKVNtn($, *) log(tK log N) log 2 K (25) 
for a fixed value of C. Then with probability at least 1 — 5e - ' the matrix <& T ^> has the RIP with constant 52K < 1/2. 

Using this theorem, we obtain the guarantee of Theorem 10 for compressible signals with the number of 
measurements M dictated by the coherence value p,(&, 

3) Applications: There are two main categories of applications where subsampled incoherent bases are used. In 
the first category, the acquisition hardware is limited by construction to measure directly in a transform domain. 
The most relevant examples are magnetic resonance imaging (MRI) [72] and tomographic imaging [73], as well 
as optical microscopy [74,75]; in all of these cases, the measurements obtained from the hardware correspond 
to coefficients of the image's 2-D continuous Fourier transform, albeit not typically selected in a randomized 
fashion. Since the Fourier functions, corresponding to sinusoids, will be incoherent with functions that have 
localized support, this imaging approach works well in practice for sparsity/compressibility transforms such as 
wavelets [69], total variation [73], and the standard canonical representation [74]. 

In the case of optical microscopy, the Fourier coefficients that are measured correspond to the lowpass regime. 
The highpass values are completely lost. When the underlying signal x can change sign, standard sparse recovery 
algorithms such as BP do not typically succeed in recovering the true underlying vector. To treat the case of 
recovery from lowpass coefficients, a special purpose sparse recovery method was developed under the name of 
Nonlocal Hard Thresholding (NLHT) [74]. This technique attempts to allocate the off-support of the sparse signal 
in an iterative fashion by performing a thresholding step that depends on the values of the neighboring locations. 

The second category involves the design of new acquisition hardware that can obtain projections of the signal 
against a class of vectors. The goal of the matrix design step is to find a basis whose elements belong to the 
class of vectors that can be implemented on the hardware. For example, a class of single pixel imagers based 
on optical modulators [60,76] (see an example in Fig. 1) can obtain projections of an image against vectors that 
have binary entries. Example bases that meet this criterion include the Walsh-Hadamard and noiselet bases [77]. 
The latter is particularly interesting for imaging applications, as it is known to be maximally incoherent with the 




Fig. 1 . Diagram of the single pixel camera. The incident lightfield (corresponding to the desired image x ) is reflected off a digital micro-mirror 
device (DMD) array whose mirror orientations are modulated in the pseudorandom pattern supplied by the random number generator (RNG). 
Each different mirror pattern produces a voltage at the single photodiode that corresponds to one measurement y{m). The process is repeated 
with different patterns M times to obtain a full measurement vector y (taken from [76]). 

Haar wavelet basis. In contrast, certain elements of the Walsh-Hadamard basis are highly coherent with wavelet 
functions at coarse scales, due to their large supports. Permuting the entries of the basis vectors (in a random or 
pseudorandom fashion) helps reduce the coherence between the measurement basis and a wavelet basis. Because 
the single pixel camera modulates the light field through optical aggregation, it improves the signal-to-noise ratio 
of the measurements obtained when compared to standard multisensor schemes [78]. Similar imaging hardware 
architectures have been developed in [79,80]. 

An additional example of a configurable acquisition device is the random sampling ADC [81], which is designed 
for acquisition of periodic, multitone analog signals whose frequency components belong to a uniform grid 
(similarly to the random demodulator of Section IV-B3). The sparsity basis for the signal once again is chosen 
to be the discrete Fourier transform basis. The measurement basis is chosen to be the identity basis, so that the 
measurements correspond to standard signal samples taken at randomized times. The hardware implementation 
employs a randomized clock that drives a traditional low-rate ADC to sample the input signal at specific times. 
As shown in Fig. 2, the random clock is implemented using an FPGA that outputs a predetermined pseudorandom 
pulse sequence indicating the times at which the signal is sampled. The patterns are timed according to a set of 
pseudorandom arithmetic progressions. This process is repeated cyclically, establishing a windowed acquisition 
for signals of arbitrary length. The measurement and sparsity framework used by this randomized ADC is also 
compatible with sketching algorithms designed for signals that are sparse in the frequency domain [81,82]. 

B. Structurally Subsampled Matrices 

In certain applications, the measurements obtained by the acquisition hardware do not directly correspond to the 
sensed signal's coefficients in a particular transform. Rather, the observations are a linear combination of multiple 
coefficients of the signal. The resulting CS matrix has been termed a structurally subsampled matrix [83]. 

1) Formulation: Consider a matrix of available measurement vectors that can be described as the product 
$ = RU, where R is a P x N mixing matrix and U is a basis. The CS matrix <I> is obtained by selecting M out 
of P rows at random, and normalizing the columns of the resulting subsampled matrix. There are two possible 
downsampling stages: first, R might offer only P < N mixtures to be available as measurements; second, we only 
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Fig. 2. Diagram of the random sampling ADC. A pseudorandom clock generated by an FPGA drives a low-rate standard ADC to sample the 
input signal at predetermined pseudorandom times. The samples obtained are then fed into a CS recovery algorithm or a sketching algorithm 
for Fourier sparse signals to estimate the frequencies and amplitudes of the tones present in the signal (taken from [81]). 

preserve M < P of the mixtures available to represent the signal. This formulation includes the use of subsampled 
incoherent bases simply by letting P = N and R = I, i.e., no coefficient mixing is performed. 

2) Theoretical guarantees: To provide theoretical guarantees we place some additional constraints on the mixing 
matrix R. 

Definition 5. The (P, N) integrator matrix S, for P a divisor of N, is defined as S = [sj s% ... Sp] T , where 
the p th row of S is defined as s p = [0 lx ( p _ 1 ) i 1i x l Oix(Ar-pL)]> 1 < P < P> with L = P/N. 

In words, using S as a mixing matrix sums together intervals of L adjacent coefficients of the signal under the 
transform U. We also use a diagonal modulation matrix M whose nonzero entries are independently drawn from 
a Rademacher distribution, and formulate our mixing matrix as R = SM. 

Theorem 16. [83, Theorem 1] Let & be a structurally subsampled matrix of size M x N obtained from the basis 
U and the P x N mixing matrix R = SM via randomized subsampling. Then for each integer K > 2, any z > 1 
and any 6 £ (0, 1), there exist absolute positive constants c\, C2 such that if 

M > Cl zKNii 2 (V, log 3 TV log 2 K, (26) 

then the matrix $ has the (K,5)-RIP with probability at least 1 — 20max{exp(— C2t) 2 z), iV -1 }. 

Similarly to the subsampled incoherent bases case, the possible values of /u(U, ^) provide us with a required 
number of measurements M ranging from 0{K log 3 N) to O(N). 

3) Applications: Compressive ADCs are one promising application of CS, which we discuss in detail in 
Section VI-B6 after introducing the infinite-dimensional CS framework. A first step in this direction is the 
architecture known as the random demodulator (RD) [84]. The RD employs structurally subsampled matrices 
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Fig. 3. Block diagram of the random demodulator (taken from [84]). 



for the acquisition of periodic, multitone analog signals whose frequency components belong in a uniform grid. 
Such signals have a finite parametrization and therefore fit the finite-dimensional CS setting. 

To be more specific, our aim is to acquire a discrete uniform sampling x G 1^ of a continuous-time signal 
f(t) observed during an acquisition interval of 1 sec where it is assumed that f(t) is of the form 

N 

f{t) = Y J Xue^ kt . (27) 
fc=i 

The vector x is sparse so that only K of its values are non-zero. As shown in Fig. 3, the sampling hardware 
consists of a mixer element that multiplies the signal f(t) with a chipping sequence p c (t) at a rate N chirps/second. 
This chipping sequence is the output of a pseudorandom number generator. The combination of these two units 
effectively performs the product of the Nyquist-sampled discrete representation of the signal x with the matrix 
M in the analog domain. The output from the modulator is sampled by an "integrate-and-dump" sampler - a 
combination of an accumulator unit and a uniform sampler synchronized with each other - at a rate of M samples 
per interval. Such sampling effectively performs a multiplication of the output of the modulator by the (M, N) 
integrator matrix S. To complete the setup, we set U = I and ^ to be the discrete Fourier transform basis; it is 
easy to see that these two bases are maximally incoherent. In the RD architecture, all subsampling is performed 
at this stage; there is no randomized subsampling of the output of the integrator. 

A prototype implementation of this architecture is reported in [85]; similar designs are proposed in [5, 86]. Note 
that the number of columns of the resulting CS matrix scales with the maximal frequency in the representation 
of f(t). Therefore, in practice, this maximal frequency cannot be very large. For example, the implementations 
reported above reach maximum frequency rates of 1MHz; the corresponding signal representation therefore has 
one million coefficients. The matrix multiplications can be implemented using algorithmic tools such as the Fast 
Fourier Transform (FFT). In conjunction with certain optimization solvers and greedy algorithms, this approach 
significantly reduces the complexity of signal recovery. While the original formulation sets the sparsity basis * 
to be the Fourier basis, limiting the set of recoverable signals to periodic multitone signals, we can move beyond 
structurally subsampled matrices by using redundant Fourier domain frames for sparsity. These frames, together 
with modified recovery algorithms, can enable acquisition of a wider class of frequency-sparse signals [87,88]. 

While the RD is capable of implementing CS acquisition for a specific class of analog signals having a finite 
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parametrization, the CS framework can be adapted to infinite-dimensional signal models, enabling more efficient 
analog acquisition and digital processing architectures. We defer the details to Section VI. 

C. Subsampled Circulant Matrices 

The use of Toeplitz and circulant structures [89-91] as CS matrices was first inspired by applications in 
communications - including channel estimation and multi-user detection - where a sparse prior is placed on 
the signal to be estimated, such as a channel response or a multiuser activity pattern. When compared with generic 
CS matrices, subsampled circulant matrices have a significantly smaller number of degrees of freedom due to the 
repetition of the matrix entries along the rows and columns. 

1 ) Formulation: A circulant matrix U is a square matrix where the entries in each diagonal are all equal, and 
where the first entry of the second and subsequent rows is equal to the last entry of the previous row. Since 
this matrix is square, we perform random subsampling of the rows (in a fashion similar to that described in 
Section IV-B) to obtain a CS matrix = RU, where R is an M x N subsampling matrix, i.e., a submatrix of 
the identity matrix. We dub $ a subsampled circulant matrix. 

2) Theoretical guarantees: Even when the sequence defining U is drawn at random from the distributions 
described in Section III, the particular structure of the subsampled circulant matrix = RU prevents the use of 
the proof techniques used in standard CS, which require all entries of the matrix to be independent. However, it 
is possible to employ different probabilistic tools to provide guarantees for subsampled circulant matrices. The 
results still require randomness in the selection of the entries of the circulant matrix. 

Theorem 17. [91] Let <I> be a subsampled circulant matrix whose distinct entries are independent random 
variables following a Rademacher distribution, and R is an arbitrary M x TV identity submatrix. Furthermore, let 
5 be the smallest value for which (7) holds for all x € E/r- Then for 5o € (0, 1) we have E[<5] < 5o provided that 

M>C maxj^'K 3 / 2 log 3/2 N, 5q 2 K log 2 N log 2 K}, 

where C > is a universal constant. Furthermore, for < A < 1, 

P(5 K > E[5] + A) < e^l°\ where a 1 = C' log 2 K log N, 

for a universal constant C > 0. 

In words, if we have M = O (if 1-5 log 15 TV"), then the RIP required for by many algorithms for successful 
recovery is achieved with high probability. 

3) Applications: There are several sensing applications where the signal to be acquired is convolved with 
the sampling hardware's impulse response before it is measured. Additionally, because convolution is equivalent 
to a product operator in the Fourier domain, it is possible to speed up the CS recovery process by performing 
multiplications by the matrices 3> and <£ T via the FFT. In fact, such an FFT-based procedure can also be exploited to 
generate good CS matrices [92]. First, we design a matrix in the Fourier domain to be diagonal and have entries 
drawn from a suitable probability distribution. Then, we obtain the measurement matrix $ by subsampling the 
matrix <£, similarly to the incoherent basis case. While this formulation assumes that the convolution during signal 
acquisition is circulant, this gap between theory and practice has been studied and shown to be controllable [93]. 
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(a) (b) 

Fig. 4. Compressive imaging via coded aperture, (a) Example of an ideal point spread function for a camera with a pinhole aperture. The size 
of the support of the function is dependent on the amount of blur caused by the imager, (b) Coded aperture used for compressive imaging (taken 
from [94]). 



Our first example concerns imaging architectures. The impulse response of the imaging hardware is known as the 
point spread function (PSF), and it represents imaging artifacts such as blurring, distortion, and other aberrations; 
an example is shown in Fig. 4(a). It is possible then to design a compressive imaging architecture by employing 
an imaging device that has a dense PSF: an impulse response having a support that includes all of the imaging 
field. This dense PSF is coupled with a downsampling step in the pixel domain to achieve compressive data 
acquisition [93,94]. This is in contrast to the random sampling advocated by Theorem 17. A popular way to 
achieve such a dense PSF is by employing so-called coded apertures, which change the pinhole aperture used for 
image formation in most cameras to a more complex design. Figure 4(b) shows an example coded aperture that 
has been used successfully in compressive imaging [93,94]. 

Our second example uses special-purpose light sensor arrays that perform the multiplication with a Toeplitz 
matrix using a custom microelectronic architecture [95], which is shown in Fig. 5. In this architecture, an 
N x N pixel light sensor array is coupled with a linear feedback shift register (LFSR) of length N 2 whose 
input is connected to a pseudorandom number generator. The bits in the LFSR control multipliers that are 
tied to the outputs from the ./V 2 light sensors, thus performing additions and subtractions according to the 
pseudorandom pattern generated. The weighted outputs are summed in two stages: column-wise sums are obtained 
at an operational amplifier before quantization, whose outputs are then added together in an accumulator. In this 
way, the microelectronic architecture calculates the inner product of the image's pixel values with the sequence 
contained in the LFSR. The output of the LFSR is fed back to its input after the register is full, so that subsequent 
measurements correspond to inner products with shifts of previous sequences. The output of the accumulator is 
sampled in a pseudorandom fashion, thus performing the subsampling required for CS. This results in an effective 
subsampled circulant CS matrix. 

D. Separable Matrices 

Separable matrices [96,97] provide computationally efficient alternatives to measure very large signals, such 
as hypercube samplings from multidimensional data. These matrices have a succinct mathematical formulation as 
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Fig. 5. Schematic for the microelectronic architecture of the convolution-based compressive imager. The acquisition device effectively 
implements a subsampled circulant CS matrix. Here, Qa denotes the quantization operation with resolution A (taken from [95]). 



Kronecker products, and feature significant structure present as correlations among matrix subblocks. 

1) Formulation: Kronecker product bases are well suited for CS applications concerning multidimensional 
signals, that is, signals that capture information from an event that spans multiple dimensions, such as spatial, 
temporal, spectral, etc. These bases can be used both to obtain sparse or compressible representations or as CS 
measurement matrices. 

The Kronecker product of two matrices A and B of sizes P x Q and R x S, respectively, is defined as 

A(1,1)B A(1,2)B ... A(1,Q)B 
A(2,1)B A(2,2)B ... A(2,Q)B 



A®B :-- 



A(P,1)B A{P,2)B 



A(P,Q)B 



Thus, A® B is a matrix of size PR x QS. Let \&i and ^2 be bases for l^ 1 and M JV2 , respectively. Then one can 



>N 2 



find a basis for R Nl ® 1* = R N ^ N ^ as * 
the Kronecker product of D matrices: 



Vl/x ® We focus now on CS matrices that can be expressed as 



$ = ^ 1 (g) $ 2 



(28) 
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If we denote the size of $ d as M d x N d , then the matrix $ has size MxN, with M = f\ d=1 M d and N = H d=1 N d . 

Next, we describe the use of Kronecker product matrices in CS of multidimensional signals. We denote the 
entries of a D-dimensional signal x by x(ni, . . . , n d ). We call the restriction of a multidimensional signal to fixed 
indices for all but its d th dimension a d-section of the signal. For example, for a 3-D signal x € M. NlXN2xNs , the 
vector Xij : . := [x(i,j, 1) x(i,j, 2) . . . x(i, j, N%)] is a 3-section of x. 

Kronecker product sparsity bases make it possible to simultaneously exploit the sparsity properties of a multidi- 
mensional signal along each of its dimensions. The new basis is simply the Kronecker product of the bases used 
for each of its d-sections. More formally, we let x £ M Nl ® . . . ® M N ° = M JViX - xIVd = Rnti* anc j assume 
that each d-section is sparse or compressible in a basis ^ d . We then define a sparsity/compressibility basis for x 
as^ = *i(8)...(8)*£), and obtain a coefficient vector 9 for the signal ensemble so that x = VP 9, where x is a 
column vector-reshaped representation of x. We then have y = <&x = <3? \I> 9. 

2) Theoretical guarantees: Due to the similarity between blocks of Kronecker product CS matrices, it is easy 
to obtain bounds for their performance metrics. Our first result concerns the RIP. 

Lemma 3. [96, 97] Let <& d , 1 < d < D, be matrices that have the (K,S d )-RIP, 1 < d < D, respectively. Then 
defined in (28), has the (K,S)-RIP, with 

D 

S<l[(l + S d )-1. (29) 

d=l 

When & d is an orthonormal basis, it has the (K, c^-RIP with 5 d = for all K < N. Therefore the RIP constant 
of the Kronecker product of an orthonormal basis and a CS matrix is equal to the RIP constant of the CS matrix. 

It is also possible to obtain results on mutual coherence (described in Section III) for cases in which the basis 
used for sparsity or compressibility can also be expressed as a Kronecker product. 
Lemma 4. [96, 97] Let <$> d , ^ d be bases for R Nd for d = 1, . . . , D. Then 

D 

/ i (*,*)=HM*d,*d)- (30) 

d=l 

Lemma 4 provides a conservation of mutual coherence across Kronecker products. Since the mutual coherence 
of each <i-section's sparsity and measurement bases is upper bounded by one, the number of Kronecker product- 
based measurements necessary for successful recovery of the multidimensional signal (following Theorems 14 
and 15) is always lower than or equal to the corresponding number of necessary partitioned measurements that 
process only a portion of the multidimensional signal along its d th dimension at a time, for some d 6 {1, . . . , D}. 
This reduction is maximized when the d-section measurement basis $ is maximally incoherent with the ti-section 
sparsity basis ^. 

3) Applications: Most applications of separable CS matrices involve multidimensional signals such as video 
sequences and hyperspectral datacubes. Our first example is an extension of the single pixel camera (see Fig. 1) 
to hyperspectral imaging [98]. The aim here is to record the reflectivity of a scene at different wavelengths; each 
wavelength has a corresponding spectral frame, so that the hyperspectral datacube can be essentially interpreted as 
the stacking of a sequence of images corresponding to the different spectra. The single pixel hyperspectral camera 
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Fig. 6. Schematic for the compressive analog CMOS imager. The acquisition device effectively implements a separable CS matrix (taken 
from [99]). 



is obtained simply by replacing the single photodiode element by a spectrometer, which records the intensity 
of the light reaching the sensor for a variety of different wavelenghts. Because the digital micromirror array 
reflects all the wavelengths of interest, the spectrometer records measurements that are each dependent only on a 
single spectral band. Since the same patterns are acting as a modulation sequence to all the spectral bands in the 
datacube, the resulting measurement matrix is expressible as $ = I <g> <£, where $ is the matrix that contains the 
patterns programmed into the mirrors. Furthermore, it is possible to compress a hyperspectral datacube by using a 
hyperbolic wavelet basis, which itself is obtained as the Kronecker product of one-dimensional wavelet bases [96]. 

Our second example concerns the transform imager [99], an imaging hardware architecture that implements a 
separable CS matrix. A sketch of the transform imager is shown in Fig. 6. The image x is partitioned into blocks 
P a of size 16 x 16 to form a set of tiles; here a indexes the block locations. The transform imager is designed 
to perform the multiplication y a = A T P a B, where A and B are fixed matrices. This product of three matrices 
can be equivalently rewritten as y a = (B ® A) T P a , where P a denotes a column vector reshaping of the block 
Pfj [100]. The CS measurements are then obtained by randomly subsampling the vector y a . The compressive 
transform imager sets A and B to be noiselet bases and uses a 2-D undecimated wavelet transform to sparsify the 
image. 

V. Structure in Finite-Dimensional Models 

Until now we focused on structure in the measurement matrix <£, and considered signals x with finite-dimensional 
sparse representations. We now turn to discuss how to exploit structure beyond sparsity in the input signal in order 
to reduce the number of measurements needed to faithfully represent it. Generalizing the notion of sparsity will 
allow us to move away from finite-dimensional models extending the ideas of CS to reduce sampling rates for 
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infinite-dimensional continuous-time signals, which is the main goal of sampling theory. 

We begin our discussion in this section within the finite-dimensional context by adding structure to the non-zero 
values of x. We will then turn, in Section VI, to more general, infinite-dimensional notions of structure, that can 
be applied to a broader class of analog signals. 

A. Multiple Measurement Vectors 

Historically, the first class of structure that has been considered within the CS framework has been that of multiple 
measurement vectors (MMVs) [101] which, similarly to sparse approximation, has been an area of interest in signal 
processing for more than a decade [21]. In this setting, rather than trying to recover a single sparse vector x, the 
goal is to jointly recover a set of vectors {xi}f =l that share a common support. Stacking these vectors into the 
columns of a matrix X, there will be at most K non-zero rows in X. That is, not only is each vector if -sparse, 
but the non-zero values occur on a common location set. We use the notation $7 = supp(X) to denote the set of 
indices of the non-zero rows. 

MMV problems appear quite naturally in many different applications areas. Early work on MMV algorithms 
focused on magnetoencephalography, which is a modality for imaging the brain [21, 102]. Similar ideas were also 
developed in the context of array processing [21, 103, 104], equalization of sparse communication channels [105, 
106], and more recently cognitive radio and multiband communications [9,86, 107-109]. 

1) Conditions on measurement matrices: As in standard CS, we assume that we are given measurements {yi}f =1 
where each vector is of length M < N. Letting Y be the M x L matrix with columns yi, our problem is to recover 
X assuming a known measurement matrix <I> so that Y = <£X. Clearly, we can apply any CS method to recover 
Xi from yi as before. However, since the vectors {xi} all have a common support, we expect intuitively to improve 
the recovery ability by exploiting this joint information. In other words, we should in general be able to reduce 
the number of measurements ML needed to represent X below SL, where S is the number of measurements 
required to recover one vector x« for a given matrix <3>. 

Since = K, the rank of X satisfies rank(X) < K. When rank(X) = 1, all the sparse vectors Xi are 
multiples of each other, so that there is no advantage to their joint processing. However, when rank(X) is large, 
we expect to be able to exploit the diversity in its columns in order to benefit from joint recovery. This essential 
result is captured by the following necessary and sufficient uniqueness condition: 

Theorem 18. [110] A necessary and sufficient condition for the measurements Y = <£X to uniquely determine 
the jointly sparse matrix X is that 

l , v m „ spark($) - l + rank(X) 

|supp(X)|< . (31) 

The sufficiency result was initially shown for the case rank(X) = | supp(X)| [111]. As shown in [110, 112], 
we can replace rank(X) by rank(Y) in (31). The sufficient direction of this condition was shown in [113] to 
hold even in the case where there are infinitely many vectors Xj. A direct consequence of Theorem 18 is that 
matrices X with larger rank can be recovered from fewer measurements. Alternatively, matrices X with larger 
support can be recovered from the same number of measurements. When rank(X) = K and spark(<I>) takes on its 
largest possible value equal to M + 1, condition (31) becomes M > K + 1. Therefore, in this best-case scenario, 
only K + 1 measurements per signal are needed to ensure uniqueness. This is much lower than the value of 2K 
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obtained in standard CS via the spark (cf. Theorem 4), which we refer to here as the single measurement vector 
(SMV) setting. Furthermore, as we now show, in the noiseless setting X can be recovered by a simple algorithm, 
in contrast to the combinatorial complexity needed to solve the SMV problem from 2K measurements for general 
matrices <E>. 

2) Recovery Algorithms: A variety of algorithms have been proposed that exploit the joint sparsity in different 
ways. As in the SMV setting, two main approaches to solving MMV problems are based on convex optimization 
and greedy methods. The analogue of (10) in the MMV case is 

X = arg min ||X|| „ subject to Y = $X, (32) 

where we define the £ P:q norms for matrices as 




with x % denoting the zth row of X. With a slight abuse of notation, we also consider the quasi-norms with p = 
such that ||X||o,g = | supp(X)| for any q. Optimization based algorithms relax the £q norm in (32) and attempt to 
recover X by mixed norm minimization [17, 112, 114-117]: 

X = arg min ||X|L subject to Y = 3>X (34) 

for some p, q > 1; values of p, q = 1,2 and oo have been advocated. 

The standard greedy approaches in the SMV setting have also been extended to the MMV case [101, 110, 114, 
118-120]. The basic idea is to replace the residual vector r by a residual matrix R, which contains the residuals 
with respect to each of the measurements, and to replace the surrogate vector $ T r by the (/-norms of the rows 
of <1> T R. For example, making these changes to OMP (Algorithm 1) yields a variant known as simultaneous 
orthogonal matching pursuit, shown as Algorithm 3, where X|q denotes the restriction of X to the rows indexed 
by Q. 

An alternative MMV strategy is the ReMBo (reduce MMV and boost) algorithm [113]. ReMBo first reduces the 
problem to an SMV that preserves the sparsity pattern, and then recovers the signal support set; given the support, 
the measurements can be inverted to recover the input. The reduction is performed by merging the measurement 
columns with random coefficients. The details of the approach together with a recovery guarantee are given in the 
following theorem. 

Theorem 19. Let X be the unique K -sparse solution matrix ofY = <&X and let have spark greater than 2K. 
Let a. be a random vector with an absolutely continuous distribution and define the random vectors y = Ya and 
x = Xa. Consider the random SMV system y = $x. Then: 

1) for every realization of a, the vector x is the unique K -sparse solution of the SMV; 

2) r2(x) = ri(X) with probability one. 

According to Theorem 19, the MMV problem is first reduced to an SMV counterpart, for which the optimal 
solution x is found. We then choose the support of X to be equal to that of x, and invert the measurement vectors 
Y over this support. In practice, computationally efficient methods are used to solve the SMV counterpart, which 
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Input: CS matrix <£, MMV matrix Y 
Output: Row-sparse matrix X 
Initialize: X = 0, R = Y, ft = 0, % ■- 
while halting criterion false do 

i <— i + 1 
b(n) ' " ^ 
ft <- 
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fiUsupp(T(M)) 
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R <- Y - $Xi 
end while 
return X <— X, 







{form residual matrix £ 9 -norm vector} 

{update row support with index of residual's row with largest magnitude} 
{update signal estimate} 
{update measurement residual} 



can lead to recovery errors in the presence of noise, or when insufficient measurements are taken. By repeating 
the procedure with different choices of a, the empirical recovery rate can be boosted significantly [113], and lead 
to superior performance over alternative MMV methods. 

The MMV techniques discussed so far are rank blind, namely, they do not explicitly take the rank of X, or that 
of Y, into account. Theorem 18 highlights the role of the rank of X in the recovery guarantees. If rank(X) = K 
and (31) is satisfied, then every K columns of <3? are linearly independent. This in turn means that 1Z(Y) = 1Z($q), 
where 1Z(Y) denotes the column range of the matrix Y. We can therefore identify the support of X by determining 
the columns 4> n that lie in 1Z(Y). One way to accomplish this is by minimizing the norm of the projections onto 
the orthogonal complement of 1Z(Y): 

min||(I-i^(Y))^|| 2 , (35) 

n 

where Pn(Y) is the orthogonal projection onto the range of Y. The objective in (35) is equal to zero if and only 
if n G ft. Since, by assumption, the columns of <5q are linearly independent, once we find the support we can 
determine X as X^ = ^Y. We can therefore formalize the following guarantee. 

Theorem 20. [110] Tfrank(X) = K and (31) holds, then the algorithm (35) is guaranteed to recover X from 
Y exactly. 

In the presence of noise, we choose the K values of n for which the expression (35) is minimized. Since (35) 
leverages the rank to achieve recovery, we say that this method is rank aware. More generally, any method whose 
performance improves with increasing rank will be termed rank aware. It turns out that it is surprisingly simple to 
modify existing greedy methods, such as OMP and thresholding, to be rank aware: instead of taking inner products 
with respect to Y or the residual R, at each stage the inner products are computed with respect to an orthonormal 
basis U for the range of Y or R [110]. 

The criterion in (35) is similar in spirit to the MUSIC algorithm [121], popular in array signal processing, which 
also exploits the signal subspace properties. As we will see below in Section VI, array processing algorithms can 
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be used to treat a variety of other structured analog sampling problems. 

The MMV model can be further extended to include the case in which there are possibly infinitely many 
measurement vectors 

y(A) = $x(A), A G A, (36) 

where A indicates an appropriate index set (which can be countable or uncountable). Here again the assumption 
is that the set of vectors {x(A), A <E A} all have common support $7 of size K. Such an infinite-set of equations 
is referred to as an infinite measurement vector (IMV) system. The common, finite, support set can be exploited 
in order to recover x(A) efficiently by solving an MMV problem [113] Reduction to a finite MMV counterpart is 
performed via the continuous to finite (CTF) block, which aims at robust detection of $7. The CTF builds a frame 
(or a basis) from the measurements using 

, , n Frame construct _ / w H / w 

y(A) ► Q = 2^y( A )y ( A ) 

AeA 

DeCOmp ° Se ) Q = VV". (37) 

Typically, Q is constructed from roughly IK snapshots y(A), and the (optional) decomposition allows removal 
of the noise space [109]. Once the basis V is determined, the CTF solves an MMV system V = <I>U with 
supp(U) < K. An alternative approach based on the MUSIC algorithm was suggested in [111, 122]. 

The crux of the CTF is that the indices of the nonidentically-zero rows of the matrix U that solves the finite 
underdetermined system V = <I>U coincide with the index set $7 that is associated with the infinite signal-set 
x(A) [113], as incoporated in the follwoing theorem. 

Theorem 21. [109] Suppose that the system of equations (36) has a unique K -sparse solution set with support £1, 
and that the matrix satisfies (31). Let V be a matrix whose column span is equal to the span of {y(A), A G A}. 
Then, the linear system V = <£U has a unique K-sparse solution with row support equal fl 

Once VL is found, the IMV system reduces to y(A) = <3>nXQ(A), which can be solved simply by computing the 
pseudo-inverse x^(A) = $>Qy(A). 

3) Performance guarantees: In terms of theoretical guarantees, it can be shown that MMV extensions of SMV 
algorithms will recover X under similar conditions to the SMV setting in the worst-case scenario [17, 112, 119] 
so that theoretical equivalence results for arbitrary values of X do not predict any performance gain with joint 
sparsity. In practice, however, multichannel reconstruction techniques perform much better than recovering each 
channel individually. The reason for this discrepancy is that these results apply to all possible input signals, and are 
therefore worst-case measures. Clearly, if we input the same signal to each channel, namely when rank(X) = 1, 
no additional information on the joint support is provided from multiple measurements. However, as we have seen 
in Theorem 18, higher ranks of the input X improve the recovery ability. In particular, when rank(X) = K, 
rank-aware algorithms such as (35) recover the true value of X from the minimal number of measurements given 
in Theorem 18. This property is not shared by the other MMV methods. 

Another way to improve performance guarantees is by posing a probability distribution on X and developing 
conditions under which X is recovered with high probability [101, 118, 119, 123]. Average case analysis can be 
used to show that fewer measurements are needed in order to recover X exactly [119]. 



29 

Theorem 22. [119] Let X 6 R ArxL &e drawn from a probabilistic model in which the indices for its K nonzero 
rows are drawn uniformly at random from the (^) possibilities and its nonzero rows ( when concatenated) are 
given by SA, where £ is an arbitrary diagonal matrix and each entry of A is an i.i.d. standard Gaussian random 
variable. If K < min(Ci/ / u 2 (<I>), C 2 A^/ 1| <E> || 2 ) then X can be recovered exactly from Y = $X via (34), with p = 2 
and q = 1, with high probability. 

In a similar fashion to the SMV case (cf. Theorem 13), while worst-case results limit the sparsity level to 
K = 0(VM), average-case analysis shows that sparsity up to order K = O(M) may enable recovery with high 
probability. Moreover, under a mild condition on the sparsity and on the matrix the failure probability decays 
exponentially in the number of channels L [119]. 

4) Applications: The MMV model has found several applications in the applied CS literature. One example is in 
the context of electroencephalography and magnetoencephalography (EEG/MEG) [21, 102]. As mentioned earlier, 
sparsity-promoting inversion algorithms have been popular in EEG/MEG literature due to their ability to accurately 
localize electromagnetic source signals. It is also possible to further improve estimation performance by introducing 
temporal regularization when a sequence of EEG/MEG is available. For example, one may apply the MMV model 
on the measurements obtained over a coherent time period, effectively enforcing temporal regularity on the brain 
activity [124]. Such temporal regularization can correct estimation errors that appear as temporal "spikes" in 
EEG/MEG activity. The example in Fig. 7 shows a test MEG activation signal with three active vertices peaking at 
separate time instances. A 306-sensor acquisition configuration was simulated with SNR = 3dB. The performance 
of MEG inversion with independent recovery of each time instance exhibits spurious activity detections that are 
removed by the temporal regularization enforced by the MMV model. Additionally, the accuracy of the temporal 
behavior for each vertex is improved; see [124] for details. 

MMV models are also used during CS recovery for certain infinite-dimensional signal models [18]. We will 
discuss this application in more detail in Section VI-B. 

B. Unions of Sub spaces 

To introduce more general models of structure on the input signal, we turn now to extend the notion of sparsity 
to a much broader class of signals which can incorporate both finite-dimensional and infinite-dimensional signal 
representations. The enabling property that allows recovery of a sparse vector x £ T>k from M < N measurements 
is the fact that the set corresponds to a union of if-dimensional subspaces within R N . More specifically, if 
we know the K nonzero locations of x, then we can represent x by a ^-dimensional vector of coefficients and 
therefore only K measurements are needed in order to recover it. Therefore, for each fixed location set, x is 
restricted to a if-dimensional subspace, which we denote by Ui. Since the location set is unknown, there are (^) 
possible subspaces Ui in which x may reside. Mathematically, this means that sparse signals can be represented 
by a union of subspaces [15]: 

m 

xeU=\JUi, (38) 

i=i 

where each subspace Ui, 1 < i < m, is a If -dimensional subspace associated with a specific set of K nonzero 
values. 



(a) Ground truth 




(b) Recovery with standard sparsity 




30 msec 65 msec 72 msec 

Fig. 7. Example performance of the MMV model for EEG inversion with temporal regularization. The setup consists of a configuration of 
306 sensors recording simulated brain activity consisting of three vertices peaking at distinct times over a 120 ms period. The figures show 
(a) the ground truth EEG activity, (b) the inversion obtained by applying MMV recovery on all the data recorded by the sensors, and (c) the 
inversion obtained by independently applying sparse recovery on the data measured at each particular time instance. In each case, we show 
three representative time instances (30 ms, 65 ms, and 72 ms, respectively). The results show that MMV is able to exclude spurious activity 
detections successfully through implicit temporal regularization (taken from [124]). 



For canonically sparse signals, the union Ex is composed of canonical subspaces Ui that are aligned with K 
out of the N coordinate axes of Mr. Allowing for more general choices of Ui leads to powerful representations 
that accommodate many interesting signal priors. It is important to note that union models are not closed under 
linear operations: The sum of two signals from a union U is generally no longer in U. This nonlinear behavior of 
the signal set renders sampling and recovery more intricate. To date, there is no general methodology to treat all 
unions in a unified manner. Therefore, we focus our attention on some specific classes of union models, in order 
of complexity. 

The simplest class of unions result when the number of subspaces comprising the union is finite, and each 
subspace has finite dimensions. We call this setup a finite union of subspaces (FUS) model. Within this class we 
consider below two special cases: 

• Structured sparse supports: This class consists of sparse vectors that meet additional restrictions on the support 
(i.e., the set of indices for the vector's nonzero entries). This corresponds to only certain subspaces Ui out of 
the (^) subspaces present in T>k being allowed [20]. 4 

• Sparse sums of subspaces where each subspace U% comprising the union is a direct sum of K low-dimensional 

4 While the description of structured sparsity given via unions of subspaces is deterministic, there exists many different CS recovery 
approaches that leverage probabilistic models designed to promote structured sparsity, such as Markov random fields, hidden Markov Trees, 
etc. [125-131], as well as deterministic approaches that rely on structured sparsity-inducing norm minimization [17, 132]. 
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subspaces [17] 



Ui — Aj. 



(39) 



Here {Aj, 1 < j < m} are a given set of subspaces with dimensions dim(„4j) = dj, and \j\ = K denotes 
a sum over K indices. Thus, each subspace Ui corresponds to a different choice of K subspaces Aj that 
comprise the sum. The dimensionality of the signal representation in this case will be N = X]j=i4/' ; f° r 
simplicity, we will often let dj = d for all j so that N = dm. As we show, this model leads to block sparsity 
in which certain blocks in a vector are zero, and others are not [120, 133, 134]. This framework can model 
standard sparsity by letting Aj, j = 1, . . . , N be the one-dimensional subspace spanned by the j th canonical 
vector. 

The two FUS cases above can be combined to allow only certain sums of K subspaces to be part of the union U. 

More complicated is the setting in which the number of possibilities is still finite while each underlying subspace 
has infinite dimensions. Finally, there may be infinitely many possibilities of finite or infinite subspaces. These 
last two classes allow us to treat different families of analog signals, and will be considered in more detail in 
Section VI. 

1 ) Conditions on measurement matrices: Guarantees for signal recovery using a FUS model can be developed 
by extending tools used in the standard sparse setting. For example, the (U,5)-RIP for FUS models [15-17,20] 
is similar to the standard RIP where instead of the inequalities in (7) having to be satisfied for all sparse vectors 
x € Y,k, they have to hold only for vectors x G U. If the constant 5 is small enough, then it can be shown that 
recovery algorithms tailored to the FUS model will recover the true underlying vector x. 

An important question is how many samples are needed roughly in order to guarantee stable recovery. This 
question is addressed in the following proposition [16, 17,20,46]. 

Proposition 1 ([46, Theorem 3.3]). Consider a matrix $ of size M x N with entries independently drawn from 
a subgaussian distribution, and let U be composed of L subspaces of dimension D. Let t > and < 5 < 1 be 
constant numbers. If 



then <I> satisfies the (U, 5) -RIP with probability at least 1 — e *. 

As observed in [46], the first term in (40) has the dominant impact on the required number of measurements for 
sparse signals in an asymptotic sense. This term quantifies the amount of measurements that are needed to code 
the exact subspace where the sparse signal resides. We now specialize this result to the two FUS classes given 
earlier. 

• In the structured sparse supports case, L corresponds to the number of distinct K-sparse supports allowed by 
the constraints, with L < (^) ; this implies a reduction in the number of measurements needed as compared 
to the traditional sparsity model. Additionally, D = K, since each subspace has dimension K. 

• For the sparse sum of subspaces setting, we focus on the case where each Aj is of dimension d; we then 




(40) 
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d\ = 3 d,2 = 4 <i2 = 2 6(4 = 6 d§ = 1 

Fig. 8. A block-sparse vector 8 over! = {di, . . . , ds}. The gray areas represent 10 non-zero entries which occupy two blocks. 



have L = ( N ^ d ) ■ Using the approximation 



(N/dK) K <L= ( N ^j < (e N/dK) K , (41) 

we conclude that for a given fraction of nonzeros r = dK/N, roughly M rj K\og(N/dK) = —K\og{r) 
measurements are needed. For comparison, to satisfy the standard RIP a larger number n » — Kdlog(r) 
is required. The FUS model reduces the total number of subspaces and therefore requires d times less 
measurements to code the signal subspace. The subspace dimension D = Kd equals the number of degrees 
of freedom in x. 

Since the number of nonzeros is the same regardless of the sparsity structure, the term D = O (K) is not reduced 
in either setting. 

There also exists an extension of the coherence property to the sparse sum of subspaces case [120]. We must 
first elaborate on the notation for this setup. Given a signal x G U our goal is to recover it from measurements 
y = Ax where A is an appropriate measurement matrix of size M x N. To recover x from y we exploit the fact 
that any x E U can be represented in an appropriate basis as a block-sparse vector [135]. This follows by first 
choosing a basis tyj for each of the subspaces Aj. Then, any x G U can be expressed as 

\i\=K 

following the notation of (39), where 9[j] E H dj are the representation coefficients in ^j. Let ^ be the column 
concatenation of ^j, and denote by 8[j] the jth sub-block of a length-iV vector 9 over I = {d±, . . . , d m }. The 
jth sub-block is of length dj, and the blocks are formed sequentially so that 

3T - ro ° ° - e N ] T . (43) 



7 di • • • u N-d m +l 



6[1] 8[m] 

If for a given x the jth subspace Aj does not appear in the sum (39), then 9[j] = 0. Finally, we can write x = ^0, 
where there are at most K non-zero blocks 6 [i] . Consequently, our union model is equivalent to the model in which 
x is represented by a block-sparse vector 9. The CS measurements can then be written as y = Ax = A^6 = $0, 
where we denoted <I> = A$>. An example of a block-sparse vector with k = 2 is depicted in Fig. 8. When di = 1 
for all i, block sparsity reduces to conventional sparsity in the dictionary We will say that a vector 9 G M. N is 
block i^-sparse if 9[i] is nonzero for at most K indices i. 

The block-sparse model we present here has also been studied in the statistical literature, where the objective 
is often quite different. Examples include group selection consistency [136, 137], asymptotic prediction proper- 
ties [137, 138], and block sparsity for logistic regression [139]. Block sparsity models are also interesting in their 
own right (namely, not only as an equivalence with an underlying union), and appear naturally in several problems. 
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Examples include DNA microarray analysis [140, 141], equalization of sparse communication channels [105], and 
source localization [104]. 

We now introduce the adaptation of coherence to sparse sums of subspaces: the block-coherence of a matrix $ 
is defined as 

/ XB($) = max^($ if M$[r]) (44) 

with p(A) denoting the spectral norm of the matrix A. Here $ is represented as a concatenation of column-blocks 
$[£] of size M x d: 

$ = [01 ... (pd 4>d+l ■■■ 4>2d ■■■ <pN-d+i ■■■ <Pn]- (45) 
$[1] <S>[2] *[m] 

When d = 1, as expected, //b($) = M$)- More generally, //b($) < /•*($)• 

While a*b($) quantifies global properties of the matrix <£, local properties are characterized by the sub-coherence 
of <£, defined as 

z/(3>) = max max 1 0f </>.,■ |, 0;,0j G (46) 

We define z^($) = for d = 1. In addition, if the columns of &[£] are orthogonal for each £, then f(3>) = 0. 

2) Recovery algorithms: Like in the MMV setting, it is possible to extend standard sparsity-based signal 
recovery algorithms to the FUS model. For example, greedy algorithms may be modified easily by changing 
the thresholding T(x, K) (which finds the best approximation of x in the union of subspaces Er-) to a structured 
sparse approximation step: 

My(i) = arg min ||x — x'\\2- (47) 

For example, the CoSaMP algorithm (see Algorithm 2) is modified according to the FUS model [20] by changing 
the following two steps: 

• Prune residual: $7 «— supp(M^ 2 (e)). 

• Prune signal: «— Mu(b). 

A similar change can be made to the IHT algorithm (15) to obtain a model-based IHT variant: 

Structured sparse approximation algorithms of the form (47) are feasible and computationally efficient for a variety 
of structured sparse support models [20, 142-144]. For example, the approximation algorithm under block 
sparsity is equivalent to block thresholding, with the K blocks with the largest energies (or £2 norms) being 
preserved; we will show another example in greater detail later in this section. 

For the sparse sum of subspaces setting, it is possible to formulate optimization-based algorithms for signal 
recovery. A convex recovery method can be obtained by minimizing the sum of the energy of the blocks 0[i]. To 
write down the problem explicitly, we define the mixed £2/^1 norm as 

m 

\\e\\2,x = J2\\ e Mh- < 48 ) 

1=1 
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We may then recover 9 by solving [17, 134, 136, 137] 

9 = arg min ll^lbx subject to y = <&9. (49) 

6»eR N 

The optimization constraints can be relaxed to address the case of noisy measurements, in a way similar to the 
standard BPIC algorithm. Generalizations of greedy algorithms to the block sparse setting have been developed in 
[101,120]. 

3) Recovery guarantees: Many recovery methods for the FUS model inherit the guarantees of their standard 
counterparts. Our first example deals with the model-based CoSaMP algorithm. Since CoSaMP requires RIP of 
order 4K, here we must rely on enlarged unions of subspaces. 

Definition 6. For an integer J and a FUS model U, denote the J-sum of U as the sum of subspaces 

Sj{U) = Uj, (50) 
\i\=J 

following the notation of (39), which contains all sums of J signals belonging in U. 
We can then pose the following guarantee. 

Theorem 23. [20] Let x £U and let y = §x + n be a set of noisy CS measurements. If & has the (S^U), 5)-RIP 
with 5 < 0.1, then the signal estimate Xi obtained from iteration i of the model-based CoSaMP algorithm satisfies 

\\x - Xih < 2" l ||x|| 2 + 15||n|| 2 . (51) 



One can also show that under an additional condition on $ the algorithm is stable to signal mismodeling [20]. 
Similar guarantees exist for the model-based IHT algorithm [20]. 

Guarantees are also available for the optimization-based approach used in the sparse sum of subspaces setting. 
Theorem 24. [17] Let x G M. N and let y = <E>x + n be a set of noisy CS measurements, with ||n||2 < e. If $ 
has the (S2(U),6)-RIP with 5 < \[2 — 1, then the signal estimate x obtained from (49) with relaxed constraints 
\\y — <$>x\\2 < e satisfies 

\\x - x\\ 2 < C^-^Wx - U u (x)\\ 2 + C 2 e, (52) 
where Mj/(s) is defined in (47), and U is a sparse sum of subspaces. 

Finally, we point out that recovery guarantees can also be obtained based on block coherence. 
Theorem 25. [120] Both the greedy methods and the optimization-based approach of (49) recover a block-sparse 
vector 9 from measurements y = <&9 if the block-coherence satisfies 

Kd< U—7^+ d -( d - 1 ) J ^)- ( 53 ) 
2 \Mb($) Vb($)/ 

In the special case in which the columns of are orthonormal for each i, we have v(<&) = and therefore the 
recovery condition becomes Kd < + d)/2. Comparing to Theorem 4 for recovery of a conventional Kd- 

sparse signal we can see that, thanks to ^b(^) < A i (^ > )> making explicit use of block-sparsity leads to guaranteed 
recovery for a potentially higher sparsity level. These results can also be extended to both adversarial and random 
additive noise in the measurements [145]. 




Fig. 9. The class of tree-sparse signals is an example of a FUS that enforces structure in the sparse signal support, (a) Example of a tree- 
structured sparse support for a 1-D piecewise smooth signal's wavelet coefficients (taken from [20]). (b) Example recovery of the N = 
512 x 512 = 262144-pixeJ Peppers test image from M = 40000 random measurements (15%) using standard CoSaMP (SNR =17.45dB). 
(c) Example recovery of the same image from the same measurements using model-based CoSaMP for the tree-structured FUS model (SNR = 
22.6dB). In both cases, the Daubechies-8 wavelet basis was used as the sparsity/compressibility transform. 

4) Applications: A particularly interesting example of structured sparse supports corresponds to tree-structured 
supports [20]. For signals that are smooth or piecewise smooth, including natural images, sufficiently smooth 
wavelet bases provide sparse or compressible representations 6. Additionally, because each wavelet basis element 
acts as a discontinuity detector in a local region of the signal at a certain scale, it is possible to link together wavelet 
basis elements corresponding to a given location at neighboring scales, forming what are known as branches that 
connect at the coarsest wavelet scale. The corresponding full graph is known as a wavelet tree. For locations where 
there exist discontinuities, the corresponding branches of wavelet coefficients tend to be large, forming a connected 
subtree inside the wavelet tree. Thus, the restriction of U C includes only the subspaces corresponding to this 
type of structure in the signal's support. Figure 9(a) shows an example of a tree-sparse 1-D signal support for a 
wavelet coefficient vector. Since the number of possible supports containing this structure is limited to L < C K /K 
for a constant C, we obtain that M = O (K) random measurements are needed to recover these signals using 
model-based recovery algorithms. Additionally, there exist approximation algorithms to implement M for this FUS 
model based on both greedy and optimization-based approaches; see [20] for more details. Figure 9(b) shows an 
example of improved signal recovery from random measurements leveraging the FUS model. 

In our development so far, we did not consider any structure within the subspaces comprising the unions. 
In certain applications it may be beneficial to add internal structure. For example, the coefficient vectors 9[j] 
may themselves be sparse. Such scenarios can be accounted for by adding an t\ penalty in (49) on the individual 
blocks [146], an approach that is dubbed C-HiLasso in [147]. This allows to combine the sparsity-inducing property 
of l\ optimization at the individual feature level, with the block-sparsity property of (49) on the group level, 
obtaining a hierarchically structured sparsity pattern. An example FUS model featuring sparse sums of subspaces 
that exploits the additional structure described above is shown in Fig. 10. This example is based on identification 





Fig. 10. Example of recovered digits (3 and 5) from a mixture with 60% of missing components. From left to right: noiseless mixture, observed 
mixture with missing pixels highlighted in red, recovered digits 3 and 5, and active set recovered for 1 80 differen t mixtures using the C-HiLasso 
and (13) respectively. The last two figures show the active sets of the recovered coefficients vectors 8 as a matrix (one column per mixture), 
where black dots indicate nonzero coefficients. The coefficients corresponding to the subspace bases for digits 3 and 5 are marked as pink bands. 
Notice that C-HiLasso efficiently exploits the FUS model, succeeding in recovering the correct active groups in all the samples. The standard 
approach (13), which lacks this stronger signal model, clearly is not capable of doing so, and active sets spread all over the groups (taken 
from [147]). 



of digits; a separate subspace is trained for each digit 0, . . . , 9, and the task addressed is separation of a mixture 
of digits from subsampled information. We collect bases that span each of the 10 subspaces Aq, . . . ,Ag into a 
dictionary <£, and apply the FUS model where the subspaces Ui considered correspond to mixtures of pairs of 
digits. The FUS model allows for successful source identification and separation. 

VI. Structure in Infinite-Dimensional Models 

One of the prime goals of CS is to allow for reduced-rate sampling of analog signals. In fact, many of the 
original papers in the field state this as a motivating drive for the theory of CS. In this section we focus on 
union models that include a degree of infiniteness: This can come into play either by allowing for infinitely many 
subspaces, by letting the subspaces have infinite dimension, or both. As we will show, the resulting models may 
be used to describe a broad class of structured continuous-time signals. We will then demonstrate how such priors 
can be translated into concrete hardware solutions that allow sampling and recovery of analog signals at rates far 
below that dictated by Nyquist. 

The approach we present here to reduced-rate sampling is based on viewing analog signals in unions of subspaces, 
and is therefore fundamentally different than previous attempts to treat similar problems, [42,84, 104, 148, 149]. 
The latter typically rely on discretization of the analog input, and are often not concerned with the actual hardware. 
Furthermore, in some cases the reduced rate analog stage results in high rate DSP. In contrast, in all of the examples 
below we treat the analog formulation directly, the DSP can be performed in real time at the low rate, and the analog 
front end can be implemented in hardware using standard analog design components such as modulators, low rate 
analog-to-digital converters (ADCs) and low-pass filters (LPFs). A more detailed discussion on the comparison to 
discretized approaches can be found in [10, 150]. 

In our review, we consider three main cases: 

• finite unions of infinite dimensional spaces; 

• infinite unions of finite dimensional spaces; 

• infinite unions of infinite dimensional spaces. 
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X(t) 



t = nT 



Fig. 11. Sampling and reconstruction in shift-invariant spaces. A compressive signal acquisition scheme can be obtained by reducing the 
number of sampling filters from N top < N and replacing the filter bank G (e J " ) with a CTF block with real-time recovery (taken from [1 8]). 



In each one of the three settings above there is an element that can take on infinite values. We present general 
theory and results behind each of these cases, and focus in additional detail on a representative example application 
for each class. 

Before describing the three cases, we first briefly introduce the notion of sampling in shift-invariant (SI) 
subspaces, which plays a key role in the development of standard (subspace) sampling theory [135, 151]. We 
then discuss how to incorporate structure into SI settings, leading to the union classes outlined above. 



A. Shift-invariant spaces for analog signals 

A signal class that plays an important role in sampling theory are signals in SI spaces [151-154]. Such signals 
are characterized by a set of generators {he(t), 1 < £ < N} where in principle N can be finite or infinite (as is 
the case in Gabor or wavelet expansions of L2). Here we focus on the case in which N is finite. Any signal in 
such a SI space can be written as 

N 

x(t) = j2J2 d ^ h ^- nT ^ (54) 

£=l nSZ 

for some set of sequences {(^[n] £ ^2, 1 < £ < N} and period T. This model encompasses many signals used in 
communication and signal processing including bandlimited functions, splines [151], multiband signals [108, 109, 
155, 156] and pulse amplitude modulation signals. 

The subspace of signals described by (54) has infinite dimensions, since every signal is associated with infinitely 
many coefficients {c^[n], 1 < ^ < N}. Any such signal can be recovered from samples at a rate of N/T; one 
possible sampling paradigm at the minimal rate is given in Fig. 11. Here x(t) is filtered with a bank of N filters, 
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each with impulse response se(i) which can be almost arbitrary, and the outputs are uniformly sampled with period 
T. Denote by c(w) a vector collecting the frequency responses of C([n], 1 < t < N. The signal is then recovered 
by first processing the samples with a filter bank with frequency response G(e J ' w ), which depends on the sampling 
filters and the generators ht(t) (see [18] for details). In this way we obtain the vectors 



containing the frequency responses of the sequences de[n], 1 < I < N. Each output sequence is then modulated 
by a periodic impulse train Ylnez^^ ~ n ^ w ^ P er i°d T, followed by filtering with the corresponding analog 
filter h e (t). 

In the ensuing subsections we consider settings in which further structure is incorporated into the generic SI 
model (54). In Sections VI-B and VI-D we treat signals of the form (54) involving a small number K of generators, 
chosen from a finite or infinite set, respectively, while in Section VI-C we consider a finite-dimensional counterpart 
of (54) in which the generators are chosen from an infinite set. All of these examples lead to union of subspaces 
models for analog signals. Our goal is to exploit the available structure in order to reduce the sampling rate. 

Before presenting the more detailed applications we point out that in all the examples below the philosophy 
is similar: we develop an analog sensing stage that consists of simple hardware devices designed to spread out 
(alias) the signal prior to sampling, in such a way that the samples contain energy from all subspace components. 
The first step in the digital reconstruction stage identifies the underlying subspace structure. Recovery is then 
performed in a subspace once the parameters defining the subspace are determined. The difference between the 
examples is in how the aliasing is obtained and in the digital recovery step which identifies the subspace. This 
framework has been dubbed Xampling in [9, 10], which combines CS and sampling, emphasizing that this is an 
analog counterpart to discrete CS theory and methods. 

B. Finite union of infinite-dimensional subspaces 

In this first case, we follow the model (38-39) where U is composed of a finite number m of subspaces, and 
each subspace has infinite dimension (i.e., dj = oo for 1 < j < m). 

1 ) Analog signal model: To incorporate structure into (54) we proceed in two ways (see Section VI-D for the 
complement). In the first, we assume that only K of the N generators are active, leading to the model 



where the notation \l\ = K means a union (or sum) over at most K elements. If the K active generators are known, 
then it suffices to sample at a rate of K/T corresponding to uniform samples with period T at the output of K 
appropriate filters. However, a more difficult question is whether the rate can be reduced if we know that only K 
of the generators are active, but do not know in advance which ones. This is a special case of the model (39) where 
now each of the subspaces At is a single-generator SI subspace spanned by ht(t). For this model, it is possible to 
reduce the sampling rate to as low as 2K/T [18]. Such rate reduction is achieved by using a bank of appropriate 
sampling filters that replaces the filters se(t) in Fig. 11, followed by postprocessing via subspace reduction and 
solving an MMV problem (cf. Section V-A). We now describe the main ideas underlying this approach. 



d(w) = G(e^)c(w) 



(55) 




(56) 



\l\=Kn£Z 



39 

2) Compressive signal acquisition scheme: A block diagram of the basic architecture is very similar to the 
one given in Fig. 11. We simply change the N sampling filters t), . . . , sjv(— t) to p < N sampling filters 
wi(t), . . . , w p (t) and replace the filter bank G(e J£J ) with the CTF block with real-time recovery (cf. Section V-A2). 
The design of the filters we(t), 1 < £ < p relies on two main ingredients: 

1) a p x N matrix <1> chosen such that it solves a discrete MMV problem with sparsity K; 

2) a set of functions {s£(t), 1 < £ < N} which can be used to sample and reconstruct the entire set of generators 
{hi(t), 1 < I < N} according to the Nyquist-rate scheme of Fig. 11. 

Based on these ingredients, the compressive sampling filters wi(t) consist of linear combinations of S((t), with 
coefficients that depend on the matrix <1> through [18] 

w(w) = M*(e^ T )**G*(e^ T )h(w), (57) 

where (•)* denotes the conjugate, w(w),h(u;) concatenate the frequency responses of we(t) (1 < £ < p) and 
he{t) (1 < £ < N), respectively, and M(e ja/r ) is a p x p arbitrary invertible matrix representing the discrete-time 
Fourier transform (DTFT) of a bank of filters. Since this matrix can be chosen arbitrarily, it allows for freedom 
in selecting the sampling filters. 

3) Reconstruction algorithm: Directly manipulating the expression for the sampled sequences leads to a simple 
relationship between the measurements ye[n] and the unknown sequences d([n\ [18]: 

y[n] = *d[n], \\d[n}\\ < K, (58) 

where the vector y[n] = [yi[n],-- - ,y p [n]] T collects the measurements at t = nT and the vector d[n] = 
[di[n],--- ,d7v[n]] T collects the unknown generator coefficients for time period n. Since only K out of the N 
sequences dg[n] are identically non-zero by assumption, the vectors {d[n]} are jointly K-sparse. Equation (58) is 
valid when M(e- ?wT ) = I for all oj; otherwise, the samples must first be pre-filtered with the inverse filter bank 
to obtain (58). Therefore, by properly choosing the sampling filters, we have reduced the recovery problem to an 
IMV problem, as studied in Section V-A2. To recover d[n] we therefore rely on the CTF and then reconstruct 
x(t) by interpolating dc[n] with their generators /i^(t). 

4) Recovery guarantees: From Theorem 21, it follows that p = 2K filters are needed in order to ensure recovery 
for all possible input signals. In practice, since polynomial-time algorithms will be used to solve the equivalent 
MMV, we will need to increase p slightly beyond 2K. 

Theorem 26. [18] Consider a signal of the form (56). Let wt(t), 1 < £ < p be a set of sampling filters defined by 
(57) where $ is a size p x N matrix. Then, the signal x(t) can be recovered exactly using the scheme of Fig. 11 
as long as * allows the solution of an MMV system of size p x N with sparsity K. 

Since our approach to recovering x(t) relies on solving an MMV system, the quality of the reconstruction depends 
directly on the properties of the underlying MMV problem. Therefore, results regarding noise, mismodeling, and 
suboptimal recovery using polynomial techniques developed in the MMV context can immediately be adapted to 
this setting. 



Spectrum of x(t) 
fp B 




Spectrum of j/i [n] Spectrum of j/;< [n] 



Fig. 12. Spectrum slices ofx(t) are overlayedin the spectrum of the output sequences yi[n]. In the example, channels i andi' realize different 
linear combinations of the spectrum slices centered around lf p , lf p , lf p . For simplicity, the aliasing of the negative frequencies is not drawn 
(taken from [9]). 

5) Example application: We now describe an application of the general analog CS architecture of Fig. 11: 
sampling of multiband signals at sub-Nyquist rates. We also expand on practical alternatives to this system, which 
reduce the hardware complexity. 

The class of multiband signals models a scenario in which x(t) consists of several concurrent radio-frequency 
(RF) transmissions. A receiver that intercepts a multiband x(t) sees the typical spectral support that is depicted 
in Fig. 12. We assume that the signal contains at most iV (symmetric) frequency bands with carriers /j, each of 
maximal width B. The carriers are limited to a maximal frequency of / max - When the carrier frequencies /j are 
fixed, the resulting signal model can be described as a subspace, and standard demodulation techniques may be 
used to sample each of the bands at low rate. A more challenging scenario is when the carriers are unknown. 
This situation arises, for example, in spectrum sensing for mobile cognitive radio (CR) receivers [157], which aim 
at utilizing unused frequency regions on an opportunistic basis. 

The Nyquist rate associated with x(t) is /nyq = 2/ max , which can be quite high in modern applications - on 
the order of several GHz. On the other hand, by exploiting the multiband structure, it can be shown that a lower 
bound on the sampling rate with unknown carriers is 2NB, as incorporated in the following theorem. We refer to 
a sampling scheme that does not exploit the carrier frequencies as blind. 

Theorem 27. [108] Consider a multiband model with maximal frequency / max an d total support $7. Let D(R) 
denote the sampling density of the blind sampling set 5 R. Then, D(R) > min{f&, 2/ max }. 

In our case the support Q, is equal to 2NB. Therefore, as long as NB < / max we can potentially reduce the 
sampling rate below Nyquist. 

5 For a formal definition of these parameters, see [108]. Intuitively, D(R) describes the average sampling rate where R = {r n } are the 
sampling points. 
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Our goal now is to use the union of subspaces framework in order to develop a sampling scheme which achieves 
the lower bound of Theorem 27. To describe a multiband signal as a union of subspaces, we divide the Nyquist 
range [—/max, /max] into M = 2L + 1 consecutive, non-overlapping, slices of individual widths f p as depicted 
in Fig. 12, such that L/T > / max . Each spectrum slice represents a single bandpass subspace Ui. By choosing 
fp > B, we ensure that no more than 2N spectrum slices are active, i.e., contain signal energy [18]. The conceptual 
division to spectrum slices does not restrict the band positions; a single band can split between adjacent slices. 

One way to realize the sampling scheme of Fig. 11 is through periodic nonuniform sampling (PNS) [108]. This 
strategy corresponds to choosing 

Wi(t) = S{t-CiT mQ ), l<i<p, (59) 

where T NY q = 1/ /nyq is the Nyquist period, and using a sampling period of T = MT NY q with M > p. Here Cj 
are integers which select part of the uniform sampling grid, resulting in p uniform sequences 

yi[n] = x((nM + q)T nyq ). (60) 

The IMV model (58) that results from PNS has sequences d([n\ representing the contents of the £th bandpass 
subspace of the relevant spectrum slice [108]. The sensing matrix <E> is a partial discrete Fourier transform (DFT), 
obtained by taking only the row indices a from the full M x M DFT matrix. 

We note that PNS was utilized for multiband sampling already in classic studies, though the traditional goal 
was to approach a rate of NB samples/sec. This rate is optimal according to the Landau theorem [158], though 
achieving it for all input signals is possible only when the spectral support is known and fixed. When the carrier 
frequencies are unknown, the optimal rate is 2NB [108]. Indeed, [155, 159] utilized knowledge of the band 
positions to design a PNS grid and the required interpolation filters for reconstruction. The approaches in [156, 
160] were semi-blind: a sampler design independent of band positions combined with the reconstruction algorithm 
of [155] which requires exact support knowledge. Other techniques targeted the rate NB by imposing alternative 
constraints on the input spectrum [111, 122]. Here we demonstrate how analog CS tools [18, 113] can lead to a 
fully-blind sampling system of multiband inputs with unknown spectra at the appropriate optimal rate [108]. A 
more thorough discussion in [108] studies the differences between the analog CS method presented here based 
on [18, 108, 113] and earlier approaches. 

6) Example hardware: As shown in [108], the PNS strategy can approach the minimal rate of Theorem 27. 
However, it requires pointwise samplers that are capable of accommodating the wide bandwidth of the input. This 
necessitates a high bandwidth track and hold device, which is difficult to build at high frequencies. An alternative 
architecture referred to as the modulated wideband converter (MWC) was developed in [109]. The MWC replaces 
the need for a high bandwidth track and hold device by using high bandwidth modulators, which are easier to 
manufacture. Indeed, off-the-shelf modulators at rates of tens of GHz are readily available. 

The MWC combines the spectrum slices dg[n] according to the scheme depicted in Fig. 13. Its design is modular 
so that when the carrier frequencies are known the same receiver can be used with fewer channels or lower sampling 
rate. Furthermore, by increasing the number of channels or the rate on each channel the same realization can be 
used for sampling full band signals at the Nyquist rate. 

The MWC consists of an analog front-end with p channels. In the ith channel, the input signal x(t) is multiplied 
by a periodic waveform pi(t) with period T, lowpass filtered by a generic filter with impulse response h(t) with 
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Fig. 13. Block diagram of the modulated wideband converter. The input passes through p parallel branches, where it is mixed with a set of 
periodic functions Pi(t), lowpass filtered and sampled at a low rate (taken from [9]). 



cutoff 1/2T, and then sampled at rate f s = 1/T. The mixing operation scrambles the spectrum of x(t), so that 
as before, the spectrum is conceptually divided into slices of width 1/T, and a weighted-sum of these slices is 
shifted to the origin [109]. The lowpass filter h(t) transfers only the narrowband frequencies up to f s /2 from that 
mixture to the output sequence yi[n]. The output has the aliasing pattern illustrated in Fig. 12. Sensing with the 
MWC leads to a matrix $ whose entries are the Fourier expansion coefficients cu of the periodic sequences Pi(t). 

The MWC can operate with as few as p = 2N channels and with a sampling rate f s = h > B on each channel, 
so that it approaches the minimal rate of 2NB. Advanced configurations enable additional hardware savings by 
collapsing the number of branches p by a factor of q at the expense of increasing the sampling rate of each channel 
by the same factor [109]. The choice of periodic functions Pi(t) is flexible: The highest Dirac frequency needs 
to exceed /max- In principle, any periodic function with high-speed transitions within the period T can satisfy 
this requirement. One possible choice for Pi(t) is a sign-alternating function, with M = 2L + 1 sign intervals 
within the period T [109, 161]. Imperfect sign alternations are allowed as long as periodicity is maintained [9]. 
This property is crucial since precise sign alternations at high speeds are extremely difficult to maintain, whereas 
simple hardware wirings ensure that pi(t) = p%(t + T p ) for every t £ 1. The waveforms Pi(t) need low mutual 
correlation in order to capture different mixtures of the spectrum. Popular binary patterns, e.g., the Gold or Kasami 
sequences, are especially suitable for the MWC [161]. Another important practical design aspect is that the lowpass 
filter h(t) does not have to be ideal. A nonflat frequency response can be compensated for in the digital domain, 
using the algorithm developed in [162]. 

The MWC has been implemented as a board-level hardware prototype [9]. 6 The hardware specifications cover 

6 A video of experiments and additional documentation for the MWC hardware are available at http://webee.technion.ac.il/Sites/People/ 
YoninaEldar/hardware.html. A graphical package demonstrating the MWC numerically is available at http://webee.technion.ac.il/Sites/People/ 
YoninaEldar/software_det3.html. 
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inputs with a 2 GHz Nyquist rate with spectrum occupation NB = 120 MHz. The total sampling rate is 280 MHz, 
far below the 2 GHz Nyquist rate. In order to save analog components, the hardware realization incorporates the 
advanced configuration of the MWC [109] with a collapsing factor q = 3. In addition, a single shift-register 
provides a basic periodic pattern, from which p periodic waveforms are derived using delays, that is, by tapping 
p different locations of the register. A nice feature of the recovery stage is that it interfaces seamlessly with 
standard DSPs by providing (samples of) the narrowband information signals. This capability is provided by a 
digital algorithm that is developed in [10]. 7 

The MWC board is a first hardware example of the use of ideas borrowed from CS for sub-Nyquist sampling 
and low-rate recovery of wideband signals where the sampling rate is directly proportional to the actual band- 
width occupation and not the highest frequency. Existing implementations of the random demodulator (RD) (cf. 
Section IV-B3) recover signals at effective sampling rates below 1 MHz, falling outside of the class of wideband 
samplers. Additionally, the signal representations used by the RD have size proportional to the Nyquist frequency, 
leading to recovery problems that are much larger than those posed by the MWC. See [9, 10] for additional 
information on the similarities and differences between the MWC, the RD, and other comparable architectures. 

C. Infinite union of finite-dimensional subspaces 

The second union class we consider is when U is composed of an infinite number m of subspaces, and each 
subspace has finite dimension. 

1 ) Analog signal model: As we have seen in Section VI-A, the SI model (54) is a convenient way to describe 
analog signals in infinite-dimensional spaces. We can use a similar approach to describe analog signals that 
lie within finite-dimensional spaces by restricting the number of unknown gains ae[n] to be finite. In order to 
incorporate structure into this model, we assume that each generator he(t) has an unknown parameter 0£ associated 
with it, which can take on values in a continuous interval, resulting in the model 

L 

x(t) = J2 a Mt,0e)- (61) 

i=i 

Each possible choice of the set {9e} leads to a different L-dimensional subspace of signals Ui, spanned by the 
functions {h(t, 9?)}. Since 9g can take on any value in a given interval, the model (61) corresponds to an infinite 
union of finite dimensional subspaces (i.e., m = oo). 

An important example of (61) is when hi(t, 9e) = h(t — t() for some unknown time delay 0\ = te, leading to a 
stream of pulses 

L 

x(t) = Y j a e h(t-t e ). (62) 
l=\ 

Here h(t) is a known pulse shape and {te,ae}f =1 , te € [0,r), ag € C are unknown delays and amplitudes. This 
model was first introduced and studied by Vetterli et al. [6-8] as a special case of signals having a finite number 
of degrees of freedom per unit time, termed finite rate of innovation (FRI) signals. 



7 The algorithm is available online at http://webee.technion.ac.il/Sites/People/YoninaEldar/software_det4.html. 
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Our goal is to sample x(t) and reconstruct it from a minimal number of samples. The primary interest is in 
pulses which have small time-support, and therefore the required Nyquist rate would be very high. However, since 
the pulse shape is known, the signal has only 2L degrees of freedom, and therefore, we expect the minimal number 
of samples to be 2L, much lower than the number of samples resulting from Nyquist rate sampling. 

A simpler version of the problem is when the signal x(t) of (62) is repeated periodically leading to the model 

L 

x(t) = aph(t — tp — mr), (63) 

m£Z 1=1 

where r is the known period. This periodic setup is easier to treat because we can exploit the properties of the 
Fourier series representation of x(t) due to the periodicity. The dimensionality and number of subspaces included 
in the model (38) remain unchanged. 

2) Compressive signal acquisition: To date, there are no general acquisition methods for signals of the form 
(61). Instead, we focus on the special case of (63). 

Our sampling scheme follows the ideas of [6-8] and consists of a filter s(t) followed by uniform sampling of 
the output with period T = r/N where N is the number of samples in one period, and N > 2L. The resulting 
samples can be written as inner products c[n] = (s(t — nT),x{t)). The following theorem establishes properties 
of the filter s(t) that allow recovery from 2L samples. 

Theorem 28. [ 163 ] Consider the t -periodic stream of pulses of order L given by ( 63 ). Choose a set K, of 
consecutive indices for which Hi^lirk/r) 7^ where H(uo) is the Fourier transform of the pulse h(t). Then the 
samples c[n] for n = 0, . . . , TV — 1, uniquely determine the signal x(t) as long as N > |/C| > 2L for any s(t) 
satisfying 

{0 u = 2vr k/r, k $ K 

nonzero uj = 2irk/T, k G K (64) 
arbitrary otherwise. 

Theorem 28 was initially focused on low pass niters in [6], and was later extended to arbitrary filters in [163]. 

To see how we can recover x(t) from the samples of Theorem 28, we note that the Fourier series coefficients 
X[k] of the periodic pulse stream x(t) are given by [6]: 

1 L 
X[k] = -H(2irk/ T )^a e e- j27Tkte/T 

T l=i 
i L 

= -H(2nk/T) Va<u{, (65) 

ti 

where up = e~ j27rtf / T . Given X[k], the problem of retrieving ap and tp in (65) is a standard problem in array 
processing [6, 164], and can be solved using methods developed in that context such as the matrix pencil [165], 
subspace-based estimators [166, 167], and the annihilating filter [8]. These methods require 2L Fourier coefficients 
to determine ap and up. In the next subsection, we show that the vector X of Fourier coefficients X[k] can be 
computed from the N samples c[n] of Theorem 28. 
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Fig. 14. Extended sampling scheme using modulating waveforms (taken from [170]). 



Since the LPF has infinite time support, the approach of (64) cannot work with time-limited signals, such as 
those of the form (62). A class of filters satisfying (64) that have finite time support are Sum of Sines (SoS) [163], 
which are given in the Fourier domain by 



G{u) = 



> bk sine 
27r keK 



2vr/r 



A: 



(66) 



where ^ 0, k G /C. Switching to the time domain 



= rect ( - 



(67) 



£ & fe e J ' 27rfet/r , 

which is clearly a time compact filter with support r. For the special case in which JC = {— p, . . . ,p} and b^ = 1, 



S(t) = rect (j-^j ej2wkt/T = rect D p (2irt/r), 



where -D p (i) denotes the Dirichlet kernel. 

Alternative approaches to sample finite pulse streams of the form (63) rely on the use of splines [7]; this enables 
obtaining moments of the signal rather than its Fourier coefficients. The moments are then processed in a similar 
fashion (see the next subsection for details). However, this approach is unstable for high values of L [7]. In contrast, 
the SoS class can be used for stable reconstruction even for very high values of L, e.g., L = 100. 

Multichannel schemes can also be used to sample pulse streams. This approach was first considered for Dirac 
streams, where a successive chain of integrators allows obtaining moments of the signal [168]. Unfortunately, the 
method is highly sensitive to noise. A simple sampling and reconstruction scheme consisting of two channels, each 
with an RC circuit, was presented in [169] for the special case where there is no more than one Dirac per sampling 
period. A more general multichannel architecture that can treat a broader class of pulses, while being much more 
stable, is depicted in Fig. 14 [170]. The system is very similar to the MWC presented in the previous section. By 
correct choice of the mixing coefficients, the Fourier coefficients X[k] may be extracted from the samples by a 
simple matrix inversion. 

3) Recovery algorithms: In both the single-channel and multichannel approaches, recovery of the unknown 
delays and amplitudes proceeds in two steps. First, the vector of samples c is related to the Fourier coefficient 
vector x through apx |/C| mixing matrix Q, as c = Qx. Here p > 2L represents the number of samples. When 
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using the SoS approach with a filter S(oj), Q = VS where S is a p x p diagonal matrix with diagonal elements 
S*(—2tt£/t), 1 < I < p, and V is a p x |/C| Vandermonde matrix with I th element given by e 3 2niT / T , 1 < £ < p, 
where T denotes the sampling period. For the multichannel architecture of Fig. 14, Q consists of the modulation 
coefficients s^. The Fourier coefficient vector x can be obtained from the samples as x = Q^c. 

The unknown parameters {tua{\f =1 are recovered from x using, e.g., the annihilating filter method. The 
annihilating filter {r[fc]}^ =0 is defined by its z-transform 

L L 

R{z) = r[k]z- k = i[(l- uez- 1 ). (68) 

k=0 1=1 

That is, the roots of R(z) equal the values through which the delays te can be found. It then follows that 

L L L 

r[k] * X[k] = ^ r[l]X[k - I] = Yl r ^ u Y l 
i=o i=o e=i 



= J2a e u k e J2r[l]u e l = 0, 



q l = 0, (69) 

1=1 1=0 

where the last equality is due to R(ue) = 0. Assuming without loss of generality that R[0] = 1, the identity in 
(69) can be written in matrix/vector form as 

/ X[-l] ... X[-L] \ ( r[l] \ / X[0] \ 



X[0] 



X[-L + l] 



'[2] 



X[l] 



\X[L-2] ... X[-l] J V r[L] J \X[L-l) J 

Thus, we only need 2L consecutive values of X[k] to determine the annihilating filter. Once the filter is found, 
the values tg are retrieved from the zeros U£ of the z-transform in (68). Finally, the Fourier coefficients ai are 
computed using (65). For example, if we have the coefficients X[k], < k < 2L — 1, then (65) may be written as 
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H(2tt/t) 
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X[2L-1] 
\ H(2tt(2L-1)/t) ) 



Since this Vandermonde matrix is left-invertible, the values an can be computed by matrix inversion. 

Reconstruction results for the sampling scheme based on the SoS filter with = 1 are depicted in Fig. 15. 
The original signal consists of L = 5 Gaussian pulses, and N = 11 samples were used for reconstruction. The 
reconstruction is exact to numerical precision. A comparison of the performance of various methods in the presence 
of noise is depicted in Fig. 15 for a finite stream consisting of 3 and 5 pulses. The pulse-shape is a Dirac delta, 
and white gaussian noise is added to the samples with a proper level in order to reach the desired SNR for all 
methods. All approaches operate using 2L + 1 samples. 
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Fig. 15. Performance comparison of finite pulse stream recovery using Gaussian [6], B-spline, E-spline [7], and SoS sampling kernels, (a) 
Reconstructed signal using SoS filters vs. original one. The reconstruction is exact to numerical precision, (b) L — 3 dirac pulses are present, 
(c) L — 5 pulses (taken from [163]). 



4) Applications: As an example application of FRI, we consider multiple image registration for superresolution 
imaging. A superresolution algorithm aims at creating a single detailed image, called a super-resolved image (SR), 
from a set of low-resolution input images of the same scene. If different images from the same scene are taken 
such that their relative shifts are not integer multiples of the pixel size, then sub-pixel information exists among 
the set. This allows to obtain a higher resolution accuracy of the scene once the images are properly registered. 

Image registration involves any group of transformations that removes the disparity between two low resolution 
(LR) images. This is followed by image fusion, which blends the properly aligned LR images into a higher 
resolution output, possibly removing blur and noise introduced by the system [171]. The registration step is 
crucial is order to obtain a good quality SR image. The theory of FRI can be extended to provide superresolution 
imaging. The key idea of this approach is that, using a proper model for the point spread function (PSF) of the 
scene acquisition system, it is possible to retrieve the underlying continuous geometric moments of the irradiance 
light-field. From this information, and assuming the disparity between any two images can be characterized by a 
global affine transformation, the set of images may be registered. The parameters obtained via FRI correspond to 
the shift vectors that register the different images before image fusion. Figure 16 shows a real-world superresolution 
example in which 40 low-resolution images allow an improvement in image resolution by a factor of 8. 

A second example application of the FRI model is ultrasonic imaging. In this imaging modality, an ultrasonic 
pulse is transmitted into a tissue, e.g., the heart, and a map of the underlying tissues is created by locating the echoes 
of the pulse. Correct location of the tissues and their edges is crucial for medical diagnosis. Current technology 
uses high rate sampling and processing in order to construct the image, demanding high computational complexity. 
Noting that the received signal is a stream of delayed and weighted versions of the known transmitted pulse shape, 
FRI sampling schemes can be exploited in order to reduce the sampling rate and the subsequent processing rates 
by several orders of magnitude while still locating the echoes. The received ultrasonic signal is modeled as a finite 
FRI problem, with a Gaussian pulse-shape. 

In Fig. 17 we consider a signal obtained using a phantom consisting of uniformly spaced pins, mimicking point 
scatterers, which is scanned by GE Healthcare's Vivid-i portable ultrasound imaging system. The data recorded by 
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Fig. 16. Example of FRI-based image superresolution. 40 images of a target scene were acquired with a digital camera, (a) Example acquired 
image, (b) Region of interest (128x128 pixels) used for superresolution. (c) Superresolved image of size 1024 x 1024 pixels (SR factor = 8). 
The PSF in this case is modeled by a B-spline of order 7 (scale 1 ). (d) Superresolved image of size 1024 x 1024 pixels (SR factor =8). The 
PSFin this case is modeled by a B-spline of order 3 (scale 2) (taken from [171]). 
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Fig. 17. Applying the SoS sampling scheme with bk = 1 on real ultrasound imaging data. Results are shown vs. original signal which uses 4160 
samples, (a) Recorded ultrasound imaging signal. The data was acquired by GE healthcare's Vivid-i ultrasound imaging system. Reconstructed 
signal (b) using TV = 17 samples (c) using N — 33 samples (taken from [163]). 



a single element in the probe is modeled as a ID stream of pulses. The recorded signal is depicted in Fig. 17(a). 
The reconstruction results using the SoS filter with bk = 1 are depicted in Fig. 17(b-c). The algorithm looked for 
the L = 4 strongest echoes, using N = 17 and TV = 33 samples. In both simulations, the estimation error in the 
location of the pulses is around 0.1 mm. These ideas have also been extended recently to allow for 2D imaging 
with multiple received elements [172]. 

D. Infinite Union of Infinite-Dimensional Subspaces 

Extending the finite-dimensional model of Section VI-C to the SI setting (54), we now incorporate structure by 
assuming that each generator he(t) has an unknown parameter 9e associated with it, leading to an infinite union 
of infinite-dimensional spaces. As with its finite counterpart, there is currently no general sampling framework 
available to treat such signals. Instead, we focus on the case in which hi(t) = h(t — Tg). 

I) Analog signal model: Suppose that 

x(t) = ^2aih(t-t e ), t e , ai £R, (70) 
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Fig. 18. Sampling and reconstruction scheme for signals of the form (71). 

where there are no more than L pulses in an interval of length T, and that the pulses do not overlap interval 
boundaries. In this case, the signal parameters in each interval can be treated separately, using the schemes of the 
previous section. In particular, we can use the sampling system of Fig. 14 where now the integral is obtained over 
intervals of length T [170]. This requires obtaining a sample from each of the channels once every T seconds, 
and using p > 2K channels, resulting in samples taken at the minimal possible rate. 

We next turn to treat the more complicated scenario in which h(t) may have support larger than T. This setting 
can no longer be treated by considering independent problems over periods of T. To simplify, we consider the 
special case in which the time delays in (70) repeat periodically (but not the amplitudes) [19, 150]. As we will show 
in this special case, efficient sampling and recovery is possible even using a single filter, and without requiring 
the pulse h(t) to be time limited. Under our assumptions, the input signal can be written as 

L 

x(t) = ^2^2a e [n]h(t-t e -nT), (71) 

«ez 1=1 

where r = {te}f =1 is a set of unknown time delays contained in the time interval [0, T), {a£[n]} are arbitrary 
bounded energy sequences, and h(t) is a known pulse shape. 

2) Compressive signal acquisition: We follow a similar approach to that in Section VI-B, which treats a 
structured SI setting where there are N possible generators. The difference is that in this current case there are 
infinitely many possibilities. Therefore, we replace the CTF in Fig. 11 with a block that supports this continuity: 
we will see that the ESPRIT method essentially replaces the CTF block [173]. 

A sampling and reconstruction scheme for signals of the form (71) is depicted in Fig. 18 [19]. The analog 
sampling stage is comprised of p parallel sampling channels. In each channel, the input signal x(t) is filtered by 
a band-limited sampling kernel s|(— t) with frequency support contained in an interval of width 2-irp/T, followed 
by a uniform sampler operating at a rate of 1/T, thus providing the sampling sequence C([n\. Note that just as in 
the MWC (Section VI-B6), the sampling filters can be collapsed to a single filter whose output is sampled at p 
times the rate of a single channel. The role of the sampling kernels is to spread out the energy of the signal in 
time, prior to low rate sampling. 

3) Recovery algorithms: To recover the signal from the samples, a properly designed digital filter correction 
bank, whose frequency response in the DTFT domain is given by W~ 1 (e^ ujT ), is applied to the sampling sequences 
in a manner similar to (55). The matrix W(e- ?tjT ) depends on the choice of the sampling kernels s|(— t) and the 
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Algorithm 4 ESPRIT Algorithm 

Input: Signal d, number of parameters L. 
Output: Delays r = {t±, . . . , t^}. 

R-dd = 2~2nez ^ M ^ [ n ] {construct correlation matrix} 
(E,cr) <- SVD(R dd ) {calculate SVD} 

E s <— L singular vectors of associated with non-zero singular values. 
$ = E^E st {compute ESPRIT matrix} 
(Ai, . . . , Xl} eig(3>) {obtain eigenvalues} 

ti < J^arg (Ai), 1 < i < L {map eigenvalues to delays} 

return r = {t±, . . . , ti} 



pulse shape h(t). Its entries are defined for 1 < £, m < p as 

W i eJUlT )i, m = ^> + 2™/T)H(u + 2nm/T). (72) 

After the digital correction stage, it can be shown that the corrected sample vector d[n] is related to the unknown 
amplitude vector a[n] by a Vandermonde matrix which depends on the unknown delays [19]. Therefore, we can 
now exploit known tools taken from the direction of arrival [174] and spectral estimation [164] literature to recover 
the delays r = {t\, . . . ,<£,}, such as the well-known ESPRIT algorithm [173]. Once the delays are determined, 
additional filtering operations are applied on the samples to recover the sequences at\n\. In particular, referring to 
Fig. 18, the matrix D is a diagonal matrix with diagonal elements equal to e~i wtk , and N(r) is a Vandermonde 
matrix with elements e -J 27rm *fe/ T . The ESPRIT algorithm is summarized for our setting as Algorithm 4, where 
E s | and E s f denote the sub matrices extracted from E s by deleting its last/first row, respectively. 

4) Recovery guarantees: The proposed algorithm is guaranteed to recover the unknown delays and amplitudes 
as long as p is large enough [19]. 

Theorem 29. [19] The sampling scheme of Fig. 18 is guaranteed to recover any signal of the form (71) as long 
as p > 2L — 77 + 1, where rj is the dimension of the minimal subspace containing the vector set {d[n], n 6 Z}. In 
addition, the filters se(t) are supported on 2irp/T and must be chosen such that W(e JwT ) of (72) is invertible. 

The sampling rate resulting from Theorem 29 is no larger than 2L/T since < rj < L. For certain signals, the 
rate can be reduced to (L + l)/T. This is due to the fact that the outputs are processed jointly, similar to the MMV 
model. Evidently, the minimal sampling rate is not related to the Nyquist rate of the pulse h(t). Therefore, for 
wideband pulse shapes, the reduction in rate can be quite substantial. As an example, consider the setup in [175], 
used for characterization of ultra-wide band wireless indoor channels. Under this setup, pulses with bandwidth of 
W = 1GHz are transmitted at a rate of l/T = 2MHz. Assuming that there are 10 significant multipath components, 
we can reduce the sampling rate down to 40MHz compared with the 2GHz Nyquist rate. 

5) Applications: Problems of the form (71) appear in a variety of different settings. For example, the model 
(71) can describe multipath medium identification problems, which arise in applications such as radar [176], 
underwater acoustics [177], wireless communications, and more. In this context, pulses with known shape are 
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transmitted through a multipath medium, which consists of several propagation paths, at a constant rate. As a 
result the received signal is composed of delayed and weighted replicas of the transmitted pulses. The delays t(, 
represent the propagation delays of each path, while the sequences ai[n\ describe the time-varying gain coefficient 
of each multipath component. 

Another important application of (71) is in the context of radar. In this example, we translate the rate reduction 
to increased resolution with a fixed time-bandwidth product (TBP), thus enabling super-resolution radar from low 
rate samples. In radar, the goal is to identify the range and velocity of K targets. The delay in this case captures the 
range while the time varying coefficients are a result of the Doppler delay related to the target velocity [150]. More 
specifically, we assume that several targets can have the same delays but possibly different Doppler shifts so that 
{te}f =1 denote the set of distinct delays. For each delay value ti there are Kg values of associated Doppler shifts 
Vik and reflection coefficients a^. We also assume that the system is highly underspread, namely v m&y T <C 1, 
where f max and T denote the maximal Doppler shift and delay. To identify the targets we transmit the signal 

N-l 

XT = Y^ x ^ - nT )> ( 73 ) 

n=0 

where x n is a known A^-length probing sequence, and h(t) is a known pulse shape. The received signal can then 
be described in the form (71), where the sequences ae[n] satisfy 

a e [n]=x n Y J W j27TI/eknT . (74) 
fc=i 

The delays and the sequences af [n] can be recovered using the general scheme for time delay recovery. The 
Doppler shifts and reflection coefficients are then determined from the sequences ac[n] using standard spectral 
estimation tools [164]. The targets can be exactly identified as long as the bandwidth W of the transmitted pulse 
satisfies W > and the length of the probing sequence satisfies N > 2maxiQ [150]. This leads to a minimal 
TBP of the input signal of WT > 87T.LmaxiQ, which is much lower than that obtained using standard radar 
processing techniques, such as matched-filtering (MF). 

An example of the identification of nine close targets is illustrated in Fig. 19(a). The sampling filter used is a 
simple LPF The original and recovered targets are shown on the Doppler-delay plane. Evidently all the targets 
were correctly identified. The result obtained by MF, with the same TBP, is shown in Fig. 19(b). Clearly, the 
compressive method has superior resolution than the standard MF in this low noise setting. Thus, the union of 
subspaces viewpoint not only offers a reduced-rate sampling method, but allows to increase the resolution in target 
identification for a fixed low TBP when the SNR is high enough, which is of paramount importance in many 
practical radar problems. 

Previous approaches for identifying the unknown delays and gains involve sampling the received signal at the 
Nyquist rate of the pulse h(t) [178-180]. However, prior knowledge of the pulse shape results in a parametric 
viewpoint, and we would expect that the rate should be proportional to the number of degrees of freedom, i.e., 
the number of paths L. 

Another approach is to quantize the delay-Doppler plane by assuming the delays and Doppler shifts lie on a 
grid [42, 104, 148, 149]. After discretization, CS tools for finite representations can be used to capture the sparsity on 
the discretized grid. Clearly, this approach has an inherent resolution limitation. Moreover, in real world scenarios, 
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(a) (b) (c) 

Fig. 19. Comparison between the target-detection performance for the case of nine targets (represented by *) in the delay-Doppler space with 
Tmax = 10 ps, v max = 10 kHz, W = 1.2 MHz, and T — 0.48 ms. The probing sequence {x n } corresponds to a random binary (±1) 
sequence with N — 48, the pulse p(t) is designed to have a nearly-flat frequency response and the pulse repetition interval is T = 10 u,s. 
Recovery of the Doppler-delay plane using (a) a union of subspaces approach, (b) a standard matched filter, and (c) a discretized delay-Doppler 
plane (taken from [150]). 



the targets do not lie exactly on the grid points. This causes a leakage of their energies in the quantized space into 
adjacent grid points [36,87]. In contrast, the union of subspaces model avoids the discretization issue and offers 
concrete sampling methods that can be implemented efficiently in hardware (when the SNR is not too poor). 

E. Discussion 

Before concluding, we point out that much of the work in this section (Sections VI-B, VI-C and VI-D) has been 
focused on infinite unions. In each case, the approach we introduced is based on a sampling mechanism using 
analog filters, followed by standard array processing tools for recovery in the digital domain. These techniques 
allow perfect recovery when no noise is present, and generally degrade gracefully in the presence of noise, as 
has been analyzed extensively in the array processing literature. Nonetheless, when the SNR is poor, alternative 
digital recovery methods based on CS may be preferable. Thus, we may increase robustness by using the analog 
sampling techniques proposed here in combination with standard CS algorithms for recovery in the digital domain. 
To apply this approach, once we have obtained the desired samples, we can view them as the measurement vector 
y in a standard CS system; the CS matrix $ is now given by a discretized Vandermonde matrix that captures all 
possible frequencies to a desired accuracy; and the vector a; is a sparse vector with non-zero elements only in 
those indices corresponding to actual frequencies [42,84, 104, 148, 149]. In this way, we combine the benefits of 
CS with those of analog sampling, without requiring discretization in the sampling stage. The discretization now 
only appears in the digital domain during recovery and not in the sampling mechanism. If we follow this strategy, 
then the results developed in Section III regarding recovery in the presence of noise are relevant here as well. 

VII. Conclusions 

In this review, our aim was to summarize applications of the basic CS framework that integrate specific 
constraints of the problem at hand into the theoretical and algorithmic formulation of signal recovery, going 
beyond the original "random measurement/sparsity model" paradigm. Due to constraints given by the relevant 
sensing devices, the classes of measurement matrices available to us are limited. Similarly, some applications focus 
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on signals that exhibit structure which cannot be captured by sparsity alone and provide additional information 
that can be leveraged during signal recovery. We also considered the transition between continuous and discrete 
representations bridged by analog-to-digital converters. Analog signals are continuous-time by nature; in many 
cases, the application of compressed sensing to this larger signal class requires the formulation of new devices, 
signal models, and recovery algorithms. It also necessitates a deeper understanding of hardware considerations that 
must be taken into account in theoretical developments. This acquisition framework, in turn, motivates the design 
of new sampling schemes and devices that provide the information required for signal recovery in the smallest 
possible representation. 

While it is difficult to cover all of the developments in compressive sensing theory and hardware [66-68, 181- 
184], our aim here was to select few examples that were representative of wider classes of problems and that 
offered a balance between a useful theoretical background and varied applications. We hope that this summary 
will be useful to practitioners in signal acquisition and processing that are interested in leveraging the features of 
compressive sensing in their specific applications. We also hope that this review will inspire further developments 
in the theory and practice underlying CS: we particularly envision extending the existing framework to broader 
signals sets and inspiring new implementation and design paradigms. With an eye to the future, more advanced 
configurations of CS can play a key role in many new frontiers. Some examples already mentioned throughout 
are cognitive radio, optical systems, medical devices such as MRI, ultrasound and more. These techniques hold 
promise for a complete rethinking of many acquisition systems and stretch the limit of current sensing capabilities. 

As we have also demonstrated, CS holds promise for increasing resolution by exploiting signal structure. This 
can revolutionize many applications such as radar and microscopy by making efficient use of the available degrees 
of freedom in these settings. Consumer electronics, microscopy, civilian and military surveillance, medical imaging, 
radar and many other rely on ADCs and are resolution-limited. Removing the Nyquist barrier in these applications 
and increasing resolution can improve the user experience, increase data transfer, improve imaging quality and 
reduce exposure time - in other words, make a prominent impact on the analog-digital world surrounding us. 
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